I have multiple datasets from multiple languages. I wanted to have a single batch from one dataset or language. But I am confused on how to do that using HuggingFace library. It’s similar to interleaving, but this doesn’t seems to manage batch (only samples single example). I think I should write a custom collator. Can anybody tell me am I going the right way or not?
Hi ! Indeed if you interleave the datasets, you will end up with batches containing examples from different datasets instead of from the same dataset. You may need to implement your own Dataset class for this Writing Custom Datasets, DataLoaders and Transforms — PyTorch Tutorials 1.11.0+cu102 documentation