I have a file in 150M size, but 100k rows inside, when i use load_dataset() to load the data, i got the following error:
Downloading and preparing dataset json/chjun–signal_5 to /root/.cache/huggingface/datasets/chjun___json/chjun–signal_5-943deb579ded8031/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e…
Downloading data files: 100%
1/1 [00:00<00:00, 54.64it/s]
Extracting data files: 100%
1/1 [00:00<00:00, 43.17it/s]
ERROR:datasets.packaged_modules.json.json:Failed to read file ‘/root/.cache/huggingface/datasets/downloads/d14d13fd8ba2262eff9553aeb2de18f0c0d8f661c6d500d0afd795dea9606792’ with error <class ‘pyarrow.lib.ArrowInvalid’>: Exceeded maximum rowsJSONDecodeError Traceback (most recent call last)
/usr/local/lib/python3.9/dist-packages/datasets/packaged_modules/json/json.py in _generate_tables(self, files)
132 with open(file, encoding=“utf-8”) as f:
→ 133 dataset = json.load(f)
134 except json.JSONDecodeError:14 frames
JSONDecodeError: Extra data: line 1 column 1478 (char 1477)During handling of the above exception, another exception occurred:
ArrowInvalid Traceback (most recent call last)
ArrowInvalid: Exceeded maximum rowsThe above exception was the direct cause of the following exception:
DatasetGenerationError Traceback (most recent call last)
/usr/local/lib/python3.9/dist-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
1891 if isinstance(e, SchemaInferenceError) and e.context is not None:
1892 e = e.context
→ 1893 raise DatasetGenerationError(“An error occurred while generating the dataset”) from e
1894
1895 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)DatasetGenerationError: An error occurred while generating the dataset
Is this a bug? I’m trying to decrease the size for my training data.