Hi !
It’s not implemented because there’s currently no communication between the workers to tell which shards from one subset or the other are being read per worker. This is important to be able to stream the dataset completely without duplicates.
Though passing num_workers=1
should actually be allowed - it’s probably worth opening an issue on GitHub about it.
Anyway you should at least be able to have num_workers=1
using a dataset defined with from_generator
- since it has only 1 shard/data_source which is its generator:
d0 = load_dataset("mozilla-foundation/common_voice_11_0", "br", streaming=True)
d1 = d0["train"]
d2 = d0["test"]
d3 = interleave_datasets([d1, d2])
def generate_data():
# use .iter() because __iter__ would raises an error in a DataLoader:
# NotImplementedError: Sharding a CyclingMultiSourcesExamplesIterable is not implemented
for batch in d3.iter(batch_size=1):
example = {key: value[0] for key, value in batch.items()}
yield example
dataset = IterableDataset.from_generator(generate_data)
If you want to use more workers to decode the audio data, you’d need to allow generate_data
to have a list of data sources as input, but it requires a bit more work to make sure that the audio decoding is distributed correctly across workers