Hi there!
I was currently working on uploading a new data resource, for which I have 24 languages as subsets, each with their respective JSONL-formatted data splits (train/validation/test).
Given that I a great experience with using the default data loader in previous datasets, I wanted to rely on it again in this case; however, it seems that despite my attempts at structuring the data/
folder by language, it is not automatically recognized as a respective subset. Quite the opposite, the data is mixed across all folders into one giant train/validation/test portion.
I have tried to look at how other multilingual datasets structure their data, and it seems all of them provide their custom loader scripts.
My question is now whether there is a way to make the default loader accept the folder strucutre, and provide access to dedicated subsets? Or do I have to rely on my own custom scripts in such an instance?
Many thanks in advance for any pointers!
Best,
Dennis