I have multiple datasets from multiple languages. I wanted to have a single batch from one dataset or language. But I am confused on how to do that using HuggingFace library. It’s similar to interleaving, but this doesn’t seems to manage batch (only samples single example). I think I should write a custom collator. Can anybody tell me am I going the right way or not?
Hi ! Indeed if you interleave the datasets, you will end up with batches containing examples from different datasets instead of from the same dataset. You may need to implement your own Dataset class for this Writing Custom Datasets, DataLoaders and Transforms — PyTorch Tutorials 1.11.0+cu102 documentation
@manirai91 Were you able to find a solution for this?
I kinda want to do the same thing, but I don’t know how to implement it.
Basically, I have two datasets of English and French sentences, and while training, I want to randomly select a batch from one of the datasets. So, during training each batch may contain samples of only English sentences or samples of only French sentences,