Load Dataset Fail for Custom Json Format

I have a json file that has the following format (used :
[ { “A” : string, “B”: list of string, “C”: list of list of bool }, sample2, sample3, …]

When I used load_dataset(“json”, data_files={“train”:data_path + “Data/train.json”), I got the following error:

datasets.builder.DatasetGenerationError: An error occurred while generating the dataset
pyarrow.lib.ArrowTypeError: Expected bytes, got a ‘list’ object
pyarrow.lib.ArrowInvalid: JSON parse error: Column() changed from object to array in row 0

What’s wrong with my procedure? The only thing I can imagine is that load_dataset() doesn’t support list of list. But I didn’t find this in the documentation, and the error message is also not explicit enough to see if that’s the reason.

Thanks in advance for any input :slight_smile:

Hi! What does "used : " mean? Can you please specify the structure inside a code block?

JSON arrays/lists are supported, but this is (still) not documented.

Thanks for your response!
The “used” was a typo, the structure is

[ { “A” : string, “B”: list of string, “C”: list of list of bool }, sample2, sample3, …]

So essentially the json is a list of dictionaries. For each dictionary the keys are strings and there are three key-value pairs; a string, a list of strings, and a nested list of bool.

Hi, this is rather late but I also ran into this issue and realized that the JSON format should be

{ “A” : string, “B”: list of string, “C”: list of list of bool }
{ “A” : string, “B”: list of string, “C”: list of list of bool }
...
{ “A” : string, “B”: list of string, “C”: list of list of bool }

basically each dictionary’s JSON str is delimited with a newline character, instead of having all the dictionaries in a list when being dumped to a file.

1 Like