Json dump format for load_dataset

Hey guys,

How do I properly encode/format json file dump (or use any other approach for creating JSON files) so that the created JSON file is easily digested by load_dataset JSON variant as described in the docs?

TIA,
Vladimir

Hi ! You can simply use .to_json() - see documentation here

Here is an example using SQuAD:

from datasets import load_dataset                   

squad = load_dataset("squad", split="train")        
squad.to_json("squad.json")            

data_files = {"train": "squad.json"}
re_squad = load_dataset("json", data_files=data_files, split="train")

This creates a JSON Lines file, then it reloads it using the JSON dataset loader :slight_smile:

3 Likes

Yes, of course, @lhoestq , but I wondered about a JSON file I made myself in some processing, not an already prepared JSON file. If I attempt to load the newly created JSON file (created using JSON dump), it needs to be one record per line, cannot have square brackets at the beginning/end, etc. How can I format any JSON holding data records so it could be consumed by datasets easily?

You can format your data as JSON Lines, so as you said:

  • one record per line (they can be created via json.dumps from the standard lib for example)
  • no square brackets at the beginning/end

Moreover:

  • use “\n” for end of lines in string data - so that each record is on one single line
  • nested fields are supported
2 Likes

Yeah, this works. I manually wrote out the JSON records one line at a time with “\n” at the end of the line using for loop. I thought there might be some sort of HF datasets encoder that could do this for me.