Hi,
I can easily create and upload JSON community datasets to HF hub (without a script). However, when using the dataset I need to specify data_files parameter, i.e. dataset = load_dataset("vblagoje/dataset_abc", data_files={"train": "train.json", "validation": "validation.json", "test": "test.json"})
which is a bit of inconvenience.
Is there a way to avoid data_files parameters short of providing a loading script as well? If not, why don’t we rely on naming conventions to resolve the data splits anyway?
Cheers,
Vladimir
Tagging @lhoestq @albertvillanova in case you hadn’t seen this!
Hi ! I like the idea of using a name convention. By default it could use something like
data_files = {
"train": ["*train*"],
"test": ["*test*"],
"validation": ["*dev*", "*valid*"]
}
Another alternative would be to let the user specify the data_files
patterns in an additional JSON file (config.json or something like this). Let me know what you think !
1 Like
Yes, that’s the perfect default @lhoestq Perhaps it could be expanded later, but for now, this solution suffices IMHO