Exceeded maximum rows when load_dataset for JSON

chjun · March 31, 2023, 3:31am

I have a file in 150M size, but 100k rows inside, when i use load_dataset() to load the data, i got the following error:

Downloading and preparing dataset json/chjun–signal_5 to /root/.cache/huggingface/datasets/chjun___json/chjun–signal_5-943deb579ded8031/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e…
Downloading data files: 100%
1/1 [00:00<00:00, 54.64it/s]
Extracting data files: 100%
1/1 [00:00<00:00, 43.17it/s]
ERROR:datasets.packaged_modules.json.json:Failed to read file ‘/root/.cache/huggingface/datasets/downloads/d14d13fd8ba2262eff9553aeb2de18f0c0d8f661c6d500d0afd795dea9606792’ with error <class ‘pyarrow.lib.ArrowInvalid’>: Exceeded maximum rows

JSONDecodeError Traceback (most recent call last)
/usr/local/lib/python3.9/dist-packages/datasets/packaged_modules/json/json.py in _generate_tables(self, files)
132 with open(file, encoding=“utf-8”) as f:
→ 133 dataset = json.load(f)
134 except json.JSONDecodeError:

14 frames
JSONDecodeError: Extra data: line 1 column 1478 (char 1477)

During handling of the above exception, another exception occurred:

ArrowInvalid Traceback (most recent call last)
ArrowInvalid: Exceeded maximum rows

The above exception was the direct cause of the following exception:

DatasetGenerationError Traceback (most recent call last)
/usr/local/lib/python3.9/dist-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
1891 if isinstance(e, SchemaInferenceError) and e.context is not None:
1892 e = e.context
→ 1893 raise DatasetGenerationError(“An error occurred while generating the dataset”) from e
1894
1895 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset

Is this a bug? I’m trying to decrease the size for my training data.

mariosasko · March 31, 2023, 2:12pm

Hi! It seems like your JSON file is not formatted correctly - it should either contain JSON Lines or a single JSON array of objects

chjun · April 6, 2023, 1:26am

But if I reduced the rows of json file to 91000, it worked well. I used FOR loop to feed data to my JSON file, there should be no difference for rows.

chjun · April 6, 2023, 9:52am

btw, this is the my code to write out the json file ready loading to HF.

    list_json = []
    for i in range(100000):
        merged_json = {"data_in": padded_flip_psf[1], 'data_out': padded_flip_psf[1]}
        list_json.append(merged_json)

    print(len(list_json))
    print(list_json[0])
    print(len(list_json[0]["data_in"]))
    print(len(list_json[0]["data_out"]))
    out_file = open('test.json', 'w')
    for item in list_json:
        json.dump(item, out_file)

    out_file.close()

chjun · April 6, 2023, 11:08am

updates: If I write json file using

json.dump(list_json, out_file)

directly instead of

for item in list_json:
json.dump(item, out_file)

, it works when load_dataset in HF.

Topic		Replies	Views
Dataset too large error 🤗Datasets	1	823	March 15, 2023
ArrowNotImplementedError when loading json dataset 🤗Datasets	3	1741	December 17, 2021
LoadDataSet pyarrow.lib.ArrowCapacityError 🤗Datasets	6	256	January 12, 2025
Load_dataset() keep throwing `ArrowInvalid: JSON parse error` Beginners	0	657	August 12, 2024
Strange pyarrow error when extracting rows from a public dataset Intermediate	2	42	April 30, 2025

Exceeded maximum rows when load_dataset for JSON

Related topics