Hugging Face Forums

How to use load_dataset to load a json file with all three splits?

anon56874081 October 3, 2022, 10:12am 1

I have a file in which all train/validation/test splits and corresponding data are included.
If I load the json file without the field argument, the error information shows that

This JSON file contain the following fields: ['train', 'validation', 'test']. Select the correct one and provide it as `field='XXX'` to the dataset loading method.

But I can only use

load_dataset("json", data_files="xx.json", field="train")

to load the specific split.
When I tried to use

load_dataset("json", data_files="xx.json", field=["train", "validation", "test"])

it seems not work.
I think it’s not necessary to split data into files. Is there a better way to meet my requirements, or should I open an issue to ask for the support for a “list” field?

1 Like

lhoestq October 3, 2022, 4:28pm 2

You can load each split separately:

ds_train = load_dataset("json", data_files="xx.json", field="train")["train"]
ds_test = load_dataset("json", data_files="xx.json", field="test")["train"]
ds_valid = load_dataset("json", data_files="xx.json", field="validation")["train"]

(you need to add ["train"] at the end because splitting is not supported right now, so everything ends up in the “train” split)

I agree it can be nice to be able to provide a mapping split_name<->field_name in the field argument, feel free to open an issue !

foggyforest April 13, 2023, 2:04pm 3

You can save the data in three json files, and use:

load_dataset("json", data_files={'train':'xx/train.json', 'validation':'xx/valid.json', 'test':'xx/test.json'})

1 Like

Topic		Replies	Views	Activity
Load pre-existing in-memory splits into a Dataset 🤗Datasets	2	1025	November 16, 2021
Loading Dataset with custom splits 🤗Datasets	1	528	July 12, 2023
Json dump format for load_dataset 🤗Datasets	5	21797	September 5, 2024
How to split Hugging Face dataset to train and test? 🤗Datasets	5	55051	January 24, 2023
Loading an imagenet-style image dataset with train/val directories 🤗Datasets	4	1778	August 12, 2022