I have a json file that has the following format (used :
[ { “A” : string, “B”: list of string, “C”: list of list of bool }, sample2, sample3, …]
When I used load_dataset(“json”, data_files={“train”:data_path + “Data/train.json”), I got the following error:
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset
pyarrow.lib.ArrowTypeError: Expected bytes, got a ‘list’ object
pyarrow.lib.ArrowInvalid: JSON parse error: Column() changed from object to array in row 0
What’s wrong with my procedure? The only thing I can imagine is that load_dataset() doesn’t support list of list. But I didn’t find this in the documentation, and the error message is also not explicit enough to see if that’s the reason.
Thanks for your response!
The “used” was a typo, the structure is
[ { “A” : string, “B”: list of string, “C”: list of list of bool }, sample2, sample3, …]
So essentially the json is a list of dictionaries. For each dictionary the keys are strings and there are three key-value pairs; a string, a list of strings, and a nested list of bool.
Hi, this is rather late but I also ran into this issue and realized that the JSON format should be
{ “A” : string, “B”: list of string, “C”: list of list of bool }
{ “A” : string, “B”: list of string, “C”: list of list of bool }
...
{ “A” : string, “B”: list of string, “C”: list of list of bool }
basically each dictionary’s JSON str is delimited with a newline character, instead of having all the dictionaries in a list when being dumped to a file.