as the title, should I shard dataset set with rank?
for example:
rank_dataset = dataset.shard(num_shards=training_args.world_size, index=training_args.rank)
or Trainer will do that automate?
as the title, should I shard dataset set with rank?
for example:
rank_dataset = dataset.shard(num_shards=training_args.world_size, index=training_args.rank)
or Trainer will do that automate?
Same question but without using Trainer.
Like if I’m using DistributedDataParallel to wrap a model but I’m not training it, just processing the dataset, so using load_from_disk and Dataset.map
The Trainer does the sharding for you. Same if you use Accelerate with your own training loop.