Yielding items from multiple datasets in parallel

Hi,

I have several datasets, and want a dataloader that can sample from multiple datasets, so iterating over the dataloader yields batch_size number of items from each dataset.

Is that possible?

Hi! You can use interleave_datasets for that and pass the returned dataset to the dataloader. Another option is to create one dataloader for each dataset and sample from them.

1 Like

Aah, I think interleave_datasets will yield batch_size items overall, from a mixture of datasets, whereas I want batch_size items from each dataset. Is that possible?

interleave_datasets cycles through the given list of datasets, which means you can set the dataloader’s batch size to batch_size * the number of interleaved datasets to get batch_size samples from each dataset in each iteration. Another option is to have a separate dataloader for each dataset.

If we use a separate dataloader for each dataset, how the training loop will look like?
for each epoch, how can we the batch from each one of the dataloaders and calculate the loss?
Especially for the case where the length of dataloaders is not the same?