Yielding items from multiple datasets in parallel

Hi,

I have several datasets, and want a dataloader that can sample from multiple datasets, so iterating over the dataloader yields batch_size number of items from each dataset.

Is that possible?

Hi! You can use interleave_datasets for that and pass the returned dataset to the dataloader. Another option is to create one dataloader for each dataset and sample from them.

1 Like

Aah, I think interleave_datasets will yield batch_size items overall, from a mixture of datasets, whereas I want batch_size items from each dataset. Is that possible?

interleave_datasets cycles through the given list of datasets, which means you can set the dataloader’s batch size to batch_size * the number of interleaved datasets to get batch_size samples from each dataset in each iteration. Another option is to have a separate dataloader for each dataset.