Loading community JSON based datasets without a script

Hi,

I can easily create and upload JSON community datasets to HF hub (without a script). However, when using the dataset I need to specify data_files parameter, i.e. dataset = load_dataset("vblagoje/dataset_abc", data_files={"train": "train.json", "validation": "validation.json", "test": "test.json"}) which is a bit of inconvenience.

Is there a way to avoid data_files parameters short of providing a loading script as well? If not, why don’t we rely on naming conventions to resolve the data splits anyway?

Cheers,
Vladimir

Tagging @lhoestq @albertvillanova in case you hadn’t seen this!

Hi ! I like the idea of using a name convention. By default it could use something like

data_files = {
    "train": ["*train*"],
    "test": ["*test*"],
    "validation": ["*dev*", "*valid*"]
}

Another alternative would be to let the user specify the data_files patterns in an additional JSON file (config.json or something like this). Let me know what you think !

1 Like

Yes, that’s the perfect default @lhoestq Perhaps it could be expanded later, but for now, this solution suffices IMHO