Hey guys,
How do I properly encode/format json file dump (or use any other approach for creating JSON files) so that the created JSON file is easily digested by load_dataset JSON variant as described in the docs?
TIA,
Vladimir
Hey guys,
How do I properly encode/format json file dump (or use any other approach for creating JSON files) so that the created JSON file is easily digested by load_dataset JSON variant as described in the docs?
TIA,
Vladimir
Hi ! You can simply use .to_json() - see documentation here
Here is an example using SQuAD:
from datasets import load_dataset
squad = load_dataset("squad", split="train")
squad.to_json("squad.json")
data_files = {"train": "squad.json"}
re_squad = load_dataset("json", data_files=data_files, split="train")
This creates a JSON Lines file, then it reloads it using the JSON dataset loader
Yes, of course, @lhoestq , but I wondered about a JSON file I made myself in some processing, not an already prepared JSON file. If I attempt to load the newly created JSON file (created using JSON dump), it needs to be one record per line, cannot have square brackets at the beginning/end, etc. How can I format any JSON holding data records so it could be consumed by datasets easily?
You can format your data as JSON Lines, so as you said:
json.dumps
from the standard lib for example)Moreover:
Yeah, this works. I manually wrote out the JSON records one line at a time with “\n” at the end of the line using for loop. I thought there might be some sort of HF datasets encoder that could do this for me.
Can you show an example of this? I have the exact same issue – trying to create my own dataset using .json but it doesn’t like anything I do. I know how to create a json in python. But the question is what exactly is the format of that json? I know to create Features and feed that to load_dataset.