Should I shard dataset in distributed training?

as the title, should I shard dataset set with rank?
for example:

rank_dataset = dataset.shard(num_shards=training_args.world_size, index=training_args.rank)

or Trainer will do that automate?

1 Like

Same question but without using Trainer.
Like if I’m using DistributedDataParallel to wrap a model but I’m not training it, just processing the dataset, so using load_from_disk and Dataset.map

The Trainer does the sharding for you. Same if you use Accelerate with your own training loop.

1 Like