How to Ensure Each Process Reads Its Own Dataset and Trains Correctly When Using Trainer？

wswaq · December 20, 2024, 10:15am

I’m using the Hugging Face Trainer for training, but my dataset is quite large, so I want each process to read its corresponding part of the dataset (I know this can be done through split_dataset_by_node or by manually handling rank and world size). However, I noticed that the trainer uses accelerate.prepare(), it wraps my DataLoader, causing it to still fetch data according to the rank. How can I resolve this issue?
Thank you in advance for your help!

Topic		Replies	Views
How to handle IterableDataset with HuggingFace trainer and num_workers in DDP setup 🤗Datasets	5	3304	September 26, 2024
Should I shard dataset in distributed training? 🤗Datasets	2	709	December 3, 2021
Torchrun, trainer, dataset setup Intermediate	4	965	December 20, 2024
Training with IterableDataset is very slow when using a large number of workers 🤗Transformers	0	1311	August 19, 2023
Does Trainer use multiple workers on datasets? 🤗Transformers	0	533	July 13, 2023

How to Ensure Each Process Reads Its Own Dataset and Trains Correctly When Using Trainer？

Related topics