Interleaving Iterable Dataset with num_workers > 0

BrunoHays · April 6, 2023, 4:45pm

Hello,

I have a couple of iterable datasets I’d like to interleave, but the dataloader raises this error:

NotImplementedError: Sharding a CyclingMultiSourcesExamplesIterable is not implemented

It looks like multiprocessing with interleaved iterable datasets is not yet supported.
Yet, having a dataloader with num_workers > 0 is important for my use case.

What would be the easiest way to achieve my goal ?

If it’s not possible to create a batch that comprises data from multiple interleaved dataset with num_workers > 0, I thought of another way to achieve similar results:

I could build a DataLoader that gives batches that alternate between the datasets and then use buffer shuffling to mix those batches together.

Does the second option seems doable ? Is it of interest for this repo ? If so I could do a PR to add it.

Cheers

Simple code to reproduce this issue:

from torch.utils.data import DataLoader

from datasets import load_dataset, interleave_datasets

if __name__ == "__main__":
    dataset = load_dataset("mozilla-foundation/common_voice_11_0", "br", streaming=True)
    dataset1 = dataset["train"]
    dataset2 = dataset["test"]
    
    d3 = interleave_datasets([dataset1, dataset2])
    dataloader = DataLoader(d3, num_workers=1, batch_size=1)
    
    for element in dataloader:
        print(element)

NotImplementedError: Sharding a CyclingMultiSourcesExamplesIterable is not implemented

lhoestq · April 7, 2023, 6:05pm

Hi !

It’s not implemented because there’s currently no communication between the workers to tell which shards from one subset or the other are being read per worker. This is important to be able to stream the dataset completely without duplicates.

Though passing num_workers=1 should actually be allowed - it’s probably worth opening an issue on GitHub about it.

Anyway you should at least be able to have num_workers=1 using a dataset defined with from_generator - since it has only 1 shard/data_source which is its generator:

d0 = load_dataset("mozilla-foundation/common_voice_11_0", "br", streaming=True)
d1 = d0["train"]
d2 = d0["test"]
d3 = interleave_datasets([d1, d2])

def generate_data(): 
    # use .iter() because __iter__ would raises an error in a DataLoader:
    # NotImplementedError: Sharding a CyclingMultiSourcesExamplesIterable is not implemented
    for batch in d3.iter(batch_size=1):  
        example = {key: value[0] for key, value in batch.items()}
        yield example

dataset = IterableDataset.from_generator(generate_data)

If you want to use more workers to decode the audio data, you’d need to allow generate_data to have a list of data sources as input, but it requires a bit more work to make sure that the audio decoding is distributed correctly across workers

BrunoHays · April 11, 2023, 10:10am

Thanks for your answer !

Before reading your answer, I tried implementing sharding of the merged iterable datasets.
I tested my customization with interleave_datasets and take commands and they seem to be working fine. But it is possible I missed something major with the sharding.

Can you have a look at my PR when you’ve got the time ? I’d like to know if my solution makes sense before I make all the improvements necessary for it to be merged

lhoestq · April 11, 2023, 3:17pm

I’ll take a look, thanks !

Topic		Replies	Views
Correct way to use multiple workers with interleave_datasets for iterable datasets 🤗Datasets	2	203	July 3, 2024
Limitations of iterable datasets 🤗Datasets	11	5312	June 28, 2024
Num_worker with IterableDataset 🤗Datasets	4	2214	November 16, 2023
Yielding items from multiple datasets in parallel 🤗Datasets	4	797	February 8, 2024
Homogeneous batches from list of IterableDatasets 🤗Datasets	6	39	October 23, 2024

Interleaving Iterable Dataset with num_workers > 0

Related topics