Dataset subsets with default Dataloader

dennlinger · October 24, 2022, 2:44pm

Hi there!

I was currently working on uploading a new data resource, for which I have 24 languages as subsets, each with their respective JSONL-formatted data splits (train/validation/test).
Given that I a great experience with using the default data loader in previous datasets, I wanted to rely on it again in this case; however, it seems that despite my attempts at structuring the data/ folder by language, it is not automatically recognized as a respective subset. Quite the opposite, the data is mixed across all folders into one giant train/validation/test portion.

I have tried to look at how other multilingual datasets structure their data, and it seems all of them provide their custom loader scripts.
My question is now whether there is a way to make the default loader accept the folder strucutre, and provide access to dedicated subsets? Or do I have to rely on my own custom scripts in such an instance?

Many thanks in advance for any pointers!
Best,
Dennis

mariosasko · October 24, 2022, 3:49pm

Hi! Yes, a custom script is needed at the moment to define multiple subsets/configs. Add support to create different configs with `push_to_hub` (+ inferring configs from directories with package managers?) · Issue #5151 · huggingface/datasets · GitHub should implement the automatic inference of configs, so feel free to comment on this issue if you have some suggestions or/and subscribe to track progress.

dennlinger · October 25, 2022, 12:00pm

Alright, thanks for the fast reply! Glad to see the issue, I’ll stop by to check what the current proposal looks like!

Topic		Replies	Views
Loading Dataset with custom splits 🤗Datasets	1	529	July 12, 2023
Loading multiple custom splits using `load_dataset('audiofolder', data_dir=/some/path)` Beginners	4	769	November 13, 2023
Uploading json, jsonl files as subset on dataset repo 🤗Datasets	3	121	November 30, 2024
Dataset loading script not working 🤗Datasets	2	430	August 31, 2023
How do I get the dataset loader working with multiple versions? 🤗Datasets	4	1564	November 8, 2022

Dataset subsets with default Dataloader

Related topics