Loading community JSON based datasets without a script

vblagoje · September 30, 2021, 7:38am

Hi,

I can easily create and upload JSON community datasets to HF hub (without a script). However, when using the dataset I need to specify data_files parameter, i.e. dataset = load_dataset("vblagoje/dataset_abc", data_files={"train": "train.json", "validation": "validation.json", "test": "test.json"}) which is a bit of inconvenience.

Is there a way to avoid data_files parameters short of providing a loading script as well? If not, why don’t we rely on naming conventions to resolve the data splits anyway?

Cheers,
Vladimir

julien-c · September 30, 2021, 11:19am

Tagging @lhoestq @albertvillanova in case you hadn’t seen this!

lhoestq · October 4, 2021, 9:17am

Hi ! I like the idea of using a name convention. By default it could use something like

data_files = {
    "train": ["*train*"],
    "test": ["*test*"],
    "validation": ["*dev*", "*valid*"]
}

Another alternative would be to let the user specify the data_files patterns in an additional JSON file (config.json or something like this). Let me know what you think !

vblagoje · October 4, 2021, 10:45am

Yes, that’s the perfect default @lhoestq Perhaps it could be expanded later, but for now, this solution suffices IMHO

Topic		Replies	Views
Testing and dummy data required for dataset loading script? 🤗Datasets	2	708	August 8, 2022
Some issues about loading script of datasets 🤗Datasets	0	23	July 31, 2024
Data files not working with custom loading script and dataset 🤗Datasets	3	1289	May 2, 2023
Data_files not working with custom loading script and remote dataset 🤗Datasets	3	762	May 12, 2023
I had collected data for a language text for translation How can I add it up into datsets 🤗Datasets	7	1574	August 23, 2021

Loading community JSON based datasets without a script

Related topics