Json dump format for load_dataset

vblagoje · September 14, 2021, 10:10am

Hey guys,

How do I properly encode/format json file dump (or use any other approach for creating JSON files) so that the created JSON file is easily digested by load_dataset JSON variant as described in the docs?

TIA,
Vladimir

lhoestq · September 14, 2021, 2:37pm

Hi ! You can simply use .to_json() - see documentation here

Here is an example using SQuAD:

from datasets import load_dataset                   

squad = load_dataset("squad", split="train")        
squad.to_json("squad.json")            

data_files = {"train": "squad.json"}
re_squad = load_dataset("json", data_files=data_files, split="train")

This creates a JSON Lines file, then it reloads it using the JSON dataset loader

vblagoje · September 14, 2021, 2:49pm

Yes, of course, @lhoestq , but I wondered about a JSON file I made myself in some processing, not an already prepared JSON file. If I attempt to load the newly created JSON file (created using JSON dump), it needs to be one record per line, cannot have square brackets at the beginning/end, etc. How can I format any JSON holding data records so it could be consumed by datasets easily?

lhoestq · September 14, 2021, 3:09pm

You can format your data as JSON Lines, so as you said:

one record per line (they can be created via json.dumps from the standard lib for example)
no square brackets at the beginning/end

Moreover:

use “\n” for end of lines in string data - so that each record is on one single line
nested fields are supported

vblagoje · September 14, 2021, 3:51pm

Yeah, this works. I manually wrote out the JSON records one line at a time with “\n” at the end of the line using for loop. I thought there might be some sort of HF datasets encoder that could do this for me.

Rasputin312 · September 5, 2024, 6:27pm

Can you show an example of this? I have the exact same issue – trying to create my own dataset using .json but it doesn’t like anything I do. I know how to create a json in python. But the question is what exactly is the format of that json? I know to create Features and feed that to load_dataset.

Topic		Replies	Views
Using datasets to open jsonl 🤗Datasets	10	59	July 2, 2025
How can I load a custom json data use load_dataset Beginners	1	311	July 8, 2022
Problem with loading custom dataset from jsonl file Beginners	1	12735	May 5, 2023
Problem reading my own dataset 🤗Datasets	0	209	May 26, 2024
Convert json dataset to "datasets.arrow_dataset.Dataset" type 🤗Datasets	0	266	May 15, 2024

Json dump format for load_dataset

Related topics