How to sample batches from multiple datasets?

manirai91 · June 24, 2022, 3:45pm

I have multiple datasets from multiple languages. I wanted to have a single batch from one dataset or language. But I am confused on how to do that using HuggingFace library. It’s similar to interleaving, but this doesn’t seems to manage batch (only samples single example). I think I should write a custom collator. Can anybody tell me am I going the right way or not?

lhoestq · June 27, 2022, 2:15pm

Hi ! Indeed if you interleave the datasets, you will end up with batches containing examples from different datasets instead of from the same dataset. You may need to implement your own Dataset class for this Writing Custom Datasets, DataLoaders and Transforms — PyTorch Tutorials 1.11.0+cu102 documentation

tehranixyz · January 18, 2024, 5:59pm

@manirai91 Were you able to find a solution for this?

I kinda want to do the same thing, but I don’t know how to implement it.
Basically, I have two datasets of English and French sentences, and while training, I want to randomly select a batch from one of the datasets. So, during training each batch may contain samples of only English sentences or samples of only French sentences,

Topic		Replies	Views
Making multiple samples from single samples using HuggingFace Datasets 🤗Datasets	6	2665	March 3, 2024
Alternating between batches of different datasets Intermediate	0	222	February 8, 2024
Multilingual batches 🤗Datasets	3	50	December 12, 2024
Yielding items from multiple datasets in parallel 🤗Datasets	4	845	February 8, 2024
HuggingFace dataset: each element in list of batch should be of equal size 🤗Datasets	3	10390	August 10, 2023

How to sample batches from multiple datasets?

Related topics