How to handle streaming datasets with DDP?

goku · January 24, 2024, 8:55pm

Let’s say I have a dataset with 5 samples with values [1, 2, 3, 4, 5], with 2 GPUs (for DDP) and batch size of 2. This dataset is IterableDataset since I am streaming it.

Now I split the dataset using split_dataset_by_node to ensure it doesn’t get repeated. And since it’s already splitted, I don’t have to use DistributedSampler?

But in this case I noticed that the:

First iteraton:
first GPU will get → [1, 2]
first GPU will get → [3, 4]

Second iteraton:
first GPU will get → [5]
first GPU will get → Nothing

which actually creates an issue since in case of DistributedSampler, the samples are repeated internally to ensure non of the GPUs at any iteration is missing any data for gradient sync. So my questions are:

Here since splitting is happening before hand, how to make sure each GPU get’s a batch at each iteration to avoid gradient sync issues?
Do we need to use DistributedSampler? If yes, how?
If dataset.n_shards % world_size != 0, is it possible to shard the streaming dataset on the fly to avoid the case where data is missing?

swheeler · January 28, 2024, 7:31pm

I have the same question

Topic		Replies	Views
Keeping IterableDataset node-wise split fixed during DDP 🤗Datasets	8	1939	April 29, 2024
Distributed data sampling for streaming 🤗Datasets	2	1767	October 4, 2023
How to handle IterableDataset with HuggingFace trainer and num_workers in DDP setup 🤗Datasets	5	2963	September 26, 2024
How does Accelerate ensure uniqueness of data samples across GPUs? 🤗Accelerate	2	866	June 21, 2023
Problem in training iterable dataset 🤗Datasets	1	1033	December 26, 2023

How to handle streaming datasets with DDP?

Related topics