Creating sharded IterableDataset from a list of IterableDatasets?

zhh210 · June 30, 2024, 4:55am

I had the exact same problem. HF’s datasets.interleave_datasets() can deal with a list of iterabledatasets but the returned iterabledataset will have n_shards being the smallest of the list. In your case it is 1. So the workaround to achieve the goal is to pre-process so that all iterabledatasets in the list have a n_shards of n before passing over to interleave_datasets().

Topic		Replies	Views
Correct way to use multiple workers with interleave_datasets for iterable datasets 🤗Datasets	2	299	July 3, 2024
Interleaving Iterable Dataset with num_workers > 0 🤗Datasets	3	1599	April 11, 2023
Homogeneous batches from list of IterableDatasets 🤗Datasets	6	74	October 23, 2024
Making an infinite IterableDataset 🤗Datasets	6	137	March 19, 2025
Num_worker with IterableDataset 🤗Datasets	4	2841	November 16, 2023

Creating sharded IterableDataset from a list of IterableDatasets?

Related topics