Correct way to use multiple workers with interleave_datasets for iterable datasets

alex-hh · June 22, 2024, 10:23pm

I would like to use multiple workers and interleave datasets with iterable datasets.

What is the best way to do this? I gather from this PR Implement sharding on merged iterable datasets by Hubert-Bonisseur · Pull Request #5735 · huggingface/datasets · GitHub that the individual datasets should be sharded? How is this achieved if using load_dataset?

lhoestq · July 3, 2024, 12:53pm

Sharding depends on the dataset, one shard corresponds to one file.

Sharding is quite important to enable the use of num_workers in in data loaders, so feel free to use datasets that are already sharded or shard a dataset your self, e.g. using:

from datasets import load_dataset

ds = load_dataset(...)
ds.push_to_hub(repo_id, num_shards=...)

system · July 4, 2024, 1:08pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Interleaving Iterable Dataset with num_workers > 0 🤗Datasets	3	1566	April 11, 2023
Yielding items from multiple datasets in parallel 🤗Datasets	4	843	February 8, 2024
Is split_dataset_by_node (streaming dataset) compatible with multi processing? 🤗Datasets	1	71	October 23, 2024
Making an infinite IterableDataset 🤗Datasets	6	93	March 19, 2025
Creating sharded IterableDataset from a list of IterableDatasets? Intermediate	2	551	July 2, 2024

Correct way to use multiple workers with interleave_datasets for iterable datasets

Related topics