Distributed data sampling for streaming

BodorS · September 26, 2023, 6:55am

I’m reading data using stream and I need to pass the data to a pipeline which is ran in distributed manner, where each processes is expected to handle different batch of data.

when I tried the following

dataset = load_dataset("oscar-corpus/OSCAR-2301",
                        token= token
                        language="ar", 
                        streaming=True
                        split="train",
                      )

dataloader= iter(DataLoader(dataset, num_workers=5,batch_size = 1000,collate_fn = lambda x: [i for i in x]))

run_pipes(
        inputs= dataloader, # any inputs of type Iterable 
    )

it didn’t work, the dataloader was replicated across processes, and processes ended up with the same batch of data.

mariosasko · September 26, 2023, 1:33pm

Hi! You should be able to avoid this data duplication by using split_dataset_by_node as explained in IterableDataset returns duplicated data using PyTorch DDP · Issue #5360 · huggingface/datasets · GitHub.

BodorS · October 4, 2023, 9:03am

Thank you, that solved the issue

Topic		Replies	Views
How to handle streaming datasets with DDP? 🤗Datasets	1	561	January 28, 2024
Problem in training iterable dataset 🤗Datasets	1	1032	December 26, 2023
Keeping IterableDataset node-wise split fixed during DDP 🤗Datasets	8	1937	April 29, 2024
How to use split_dataset_by_node and shuffle on iterable dataset 🤗Datasets	3	539	February 17, 2025
MutliGPU Training using split_dataset_per_node with PyTorch Lightning 🤗Datasets	1	750	May 24, 2024

Distributed data sampling for streaming

Related topics