Should I shard dataset in distributed training?

pipi · December 3, 2021, 6:08am

as the title, should I shard dataset set with rank?
for example:

rank_dataset = dataset.shard(num_shards=training_args.world_size, index=training_args.rank)

or Trainer will do that automate?

PaulLerner · December 3, 2021, 1:24pm

Same question but without using Trainer.
Like if I’m using DistributedDataParallel to wrap a model but I’m not training it, just processing the dataset, so using load_from_disk and Dataset.map

sgugger · December 3, 2021, 5:07pm

The Trainer does the sharding for you. Same if you use Accelerate with your own training loop.

Topic		Replies	Views
Batching vs. Sharding a Large Dataset 🤗Datasets	4	2207	June 8, 2021
Using large dataset with accelerate 🤗Accelerate	0	45	March 6, 2025
Support of very large dataset? 🤗Datasets	12	10345	August 24, 2022
Trainer default distributed training behaviour 🤗Transformers	2	22	May 15, 2025
Using Transformers with DistributedDataParallel — any examples? Intermediate	11	23166	May 8, 2023

Should I shard dataset in distributed training?

Related topics