Dataset subsets with default Dataloader

Hi there!

I was currently working on uploading a new data resource, for which I have 24 languages as subsets, each with their respective JSONL-formatted data splits (train/validation/test).
Given that I a great experience with using the default data loader in previous datasets, I wanted to rely on it again in this case; however, it seems that despite my attempts at structuring the data/ folder by language, it is not automatically recognized as a respective subset. Quite the opposite, the data is mixed across all folders into one giant train/validation/test portion.

I have tried to look at how other multilingual datasets structure their data, and it seems all of them provide their custom loader scripts.
My question is now whether there is a way to make the default loader accept the folder strucutre, and provide access to dedicated subsets? Or do I have to rely on my own custom scripts in such an instance?

Many thanks in advance for any pointers!

Hi! Yes, a custom script is needed at the moment to define multiple subsets/configs. Add support to create different configs with `push_to_hub` (+ inferring configs from directories with package managers?) · Issue #5151 · huggingface/datasets · GitHub should implement the automatic inference of configs, so feel free to comment on this issue if you have some suggestions or/and subscribe to track progress.

1 Like

Alright, thanks for the fast reply! Glad to see the issue, I’ll stop by to check what the current proposal looks like!