Interleaving Iterable Dataset with num_workers > 0

lhoestq · April 7, 2023, 6:05pm

Hi !

It’s not implemented because there’s currently no communication between the workers to tell which shards from one subset or the other are being read per worker. This is important to be able to stream the dataset completely without duplicates.

Though passing num_workers=1 should actually be allowed - it’s probably worth opening an issue on GitHub about it.

Anyway you should at least be able to have num_workers=1 using a dataset defined with from_generator - since it has only 1 shard/data_source which is its generator:

d0 = load_dataset("mozilla-foundation/common_voice_11_0", "br", streaming=True)
d1 = d0["train"]
d2 = d0["test"]
d3 = interleave_datasets([d1, d2])

def generate_data(): 
    # use .iter() because __iter__ would raises an error in a DataLoader:
    # NotImplementedError: Sharding a CyclingMultiSourcesExamplesIterable is not implemented
    for batch in d3.iter(batch_size=1):  
        example = {key: value[0] for key, value in batch.items()}
        yield example

dataset = IterableDataset.from_generator(generate_data)

If you want to use more workers to decode the audio data, you’d need to allow generate_data to have a list of data sources as input, but it requires a bit more work to make sure that the audio decoding is distributed correctly across workers

Topic		Replies	Views
Correct way to use multiple workers with interleave_datasets for iterable datasets 🤗Datasets	2	300	July 3, 2024
Creating sharded IterableDataset from a list of IterableDatasets? Intermediate	2	592	July 2, 2024
Making an infinite IterableDataset 🤗Datasets	6	140	March 19, 2025
Num_worker with IterableDataset 🤗Datasets	4	2854	November 16, 2023
Homogeneous batches from list of IterableDatasets 🤗Datasets	6	74	October 23, 2024

Interleaving Iterable Dataset with num_workers > 0

Related topics