Loading specific features in a JSON dataset

Hi. Is there a way to load specific fields in a dataset stored in JSON or JSON lines format?

For example, if a file contains the following lines (extracted from here):

{"id":1,"father":"Mark","mother":"Charlotte","children":["Tom"]}
{"id":2,"father":"John","mother":"Ann","children":["Jessika","Antony","Jack"]}
{"id":3,"father":"Bob","mother":"Monika","children":["Jerry","Karol"]}

How can I load only the id, father, and mother features (leaving out the `children feature)?

Thanks.

We use PyArrow to read JSON files into Arrow tables, but according to the documentation it doesn’t seem to be possible to load only a subset of fields: pyarrow.json.read_json — Apache Arrow v14.0.1

Though it’s possible to load a subset of fields if the data is in Parquet, since it’s a columnar format. You just need to pass columns=... (see the ParquetConfig parameters)

1 Like