Exceeded maximum rows when load_dataset for JSON

I have a file in 150M size, but 100k rows inside, when i use load_dataset() to load the data, i got the following error:

Downloading and preparing dataset json/chjun–signal_5 to /root/.cache/huggingface/datasets/chjun___json/chjun–signal_5-943deb579ded8031/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e…
Downloading data files: 100%
1/1 [00:00<00:00, 54.64it/s]
Extracting data files: 100%
1/1 [00:00<00:00, 43.17it/s]
ERROR:datasets.packaged_modules.json.json:Failed to read file ‘/root/.cache/huggingface/datasets/downloads/d14d13fd8ba2262eff9553aeb2de18f0c0d8f661c6d500d0afd795dea9606792’ with error <class ‘pyarrow.lib.ArrowInvalid’>: Exceeded maximum rows

JSONDecodeError Traceback (most recent call last)
/usr/local/lib/python3.9/dist-packages/datasets/packaged_modules/json/json.py in _generate_tables(self, files)
132 with open(file, encoding=“utf-8”) as f:
→ 133 dataset = json.load(f)
134 except json.JSONDecodeError:

14 frames
JSONDecodeError: Extra data: line 1 column 1478 (char 1477)

During handling of the above exception, another exception occurred:

ArrowInvalid Traceback (most recent call last)
ArrowInvalid: Exceeded maximum rows

The above exception was the direct cause of the following exception:

DatasetGenerationError Traceback (most recent call last)
/usr/local/lib/python3.9/dist-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
1891 if isinstance(e, SchemaInferenceError) and e.context is not None:
1892 e = e.context
→ 1893 raise DatasetGenerationError(“An error occurred while generating the dataset”) from e
1894
1895 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset

Is this a bug? I’m trying to decrease the size for my training data.

Hi! It seems like your JSON file is not formatted correctly - it should either contain JSON Lines or a single JSON array of objects

But if I reduced the rows of json file to 91000, it worked well. I used FOR loop to feed data to my JSON file, there should be no difference for rows.

btw, this is the my code to write out the json file ready loading to HF.

    list_json = []
    for i in range(100000):
        merged_json = {"data_in": padded_flip_psf[1], 'data_out': padded_flip_psf[1]}
        list_json.append(merged_json)

    print(len(list_json))
    print(list_json[0])
    print(len(list_json[0]["data_in"]))
    print(len(list_json[0]["data_out"]))
    out_file = open('test.json', 'w')
    for item in list_json:
        json.dump(item, out_file)

    out_file.close()

updates: If I write json file using

json.dump(list_json, out_file)

directly instead of

for item in list_json:
json.dump(item, out_file)

, it works when load_dataset in HF.