Loading specific features in a JSON dataset

alejoa · December 3, 2023, 9:45am

Hi. Is there a way to load specific fields in a dataset stored in JSON or JSON lines format?

For example, if a file contains the following lines (extracted from here):

{"id":1,"father":"Mark","mother":"Charlotte","children":["Tom"]}
{"id":2,"father":"John","mother":"Ann","children":["Jessika","Antony","Jack"]}
{"id":3,"father":"Bob","mother":"Monika","children":["Jerry","Karol"]}

How can I load only the id, father, and mother features (leaving out the `children feature)?

Thanks.

lhoestq · December 4, 2023, 10:26am

We use PyArrow to read JSON files into Arrow tables, but according to the documentation it doesn’t seem to be possible to load only a subset of fields: pyarrow.json.read_json — Apache Arrow v14.0.1

Though it’s possible to load a subset of fields if the data is in Parquet, since it’s a columnar format. You just need to pass columns=... (see the ParquetConfig parameters)

Topic		Replies	Views
ArrowTypeError in load_dataset 🤗Datasets	1	620	June 12, 2023
Load Dataset Fail for Custom Json Format Beginners	3	8425	June 20, 2023
Loading HF datasets with variable size array using pyarrow with the appropriate schema 🤗Datasets	0	37	November 11, 2024
JSON parse error when trying to load my own SQuAD dataset Beginners	0	962	July 21, 2021
Datasets + Arrow Help Beginners	2	1381	June 9, 2022

Loading specific features in a JSON dataset

Related topics