Yielding items from multiple datasets in parallel

surya-narayanan · May 2, 2023, 8:48pm

Hi,

I have several datasets, and want a dataloader that can sample from multiple datasets, so iterating over the dataloader yields batch_size number of items from each dataset.

Is that possible?

mariosasko · May 3, 2023, 2:43pm

Hi! You can use interleave_datasets for that and pass the returned dataset to the dataloader. Another option is to create one dataloader for each dataset and sample from them.

surya-narayanan · May 9, 2023, 1:31am

Aah, I think interleave_datasets will yield batch_size items overall, from a mixture of datasets, whereas I want batch_size items from each dataset. Is that possible?

mariosasko · May 9, 2023, 1:41pm

interleave_datasets cycles through the given list of datasets, which means you can set the dataloader’s batch size to batch_size * the number of interleaved datasets to get batch_size samples from each dataset in each iteration. Another option is to have a separate dataloader for each dataset.

tehranixyz · February 8, 2024, 12:19am

If we use a separate dataloader for each dataset, how the training loop will look like?
for each epoch, how can we the batch from each one of the dataloaders and calculate the loss?
Especially for the case where the length of dataloaders is not the same?

Topic		Replies	Views
Interleaving Iterable Dataset with num_workers > 0 🤗Datasets	3	1566	April 11, 2023
Homogeneous batches from list of IterableDatasets 🤗Datasets	6	62	October 23, 2024
How to sample batches from multiple datasets? 🤗Datasets	2	1938	January 18, 2024
A couple of questions about interleave_datasets() 🤗Datasets	7	1878	March 28, 2024
Correct way to use multiple workers with interleave_datasets for iterable datasets 🤗Datasets	2	280	July 3, 2024

Yielding items from multiple datasets in parallel

Related topics