Correct way to use multiple workers with interleave_datasets for iterable datasets

I would like to use multiple workers and interleave datasets with iterable datasets.

What is the best way to do this? I gather from this PR Implement sharding on merged iterable datasets by Hubert-Bonisseur 路 Pull Request #5735 路 huggingface/datasets 路 GitHub that the individual datasets should be sharded? How is this achieved if using load_dataset?

Sharding depends on the dataset, one shard corresponds to one file.

Sharding is quite important to enable the use of num_workers in in data loaders, so feel free to use datasets that are already sharded or shard a dataset your self, e.g. using:

from datasets import load_dataset

ds = load_dataset(...)
ds.push_to_hub(repo_id, num_shards=...)

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.