Strange pyarrow error when extracting rows from a public dataset

OlivierSchmittSonar · April 29, 2025, 4:17pm

Hi,

I’m trying to download and process rows from

I’m using the generic load_dataset function (using datasets = “==3.5.1”):
rows = load_dataset(path="OpenCoder-LLM/opc-annealing-corpus", name="algorithmic_corpus", split="train")

I can download and generate the split:

Downloading data: 100%|██████████| 34/34 [00:00<00:00, 86.59files/s]
Generating train split: 10579390 examples [00:07, 1452745.44 examples/s]

But just after building the split it seems load_dataset tries to load this file as an arrow file:

https://huggingface.co/datasets/OpenCoder-LLM/opc-annealing-corpus/blob/main/algorithmic_corpus/state.json

And this leads to an error:

ERROR: Failed to read file '/Users/<HOME>/.cache/huggingface/hub/datasets--OpenCoder-LLM--opc-annealing-corpus/snapshots/cb08b1b40bb19f88e3c4f48d6b4647bed588fc04/algorithmic_corpus/state.json' with error <class 'pyarrow.lib.ArrowInvalid'>: Not an Arrow file

pyarrow.lib.ArrowInvalid: Expected to read 538970747 metadata bytes, but only read 2131

Which makes sense because state.json is not an arrow file , it’s describing the dataset.
I’d expect load_dataset since it’s able to load the proper arrow files to not consider this Json file as an arrow file.
Looks like a bug.

Any idea how to fix this?
Thanks a lot.

John6666 · April 30, 2025, 1:42am

Hmm… PyArrow format datasets may not be loaded using the load_dataset function.

hyxmmm
It seems that the arrow dataset is not supported to be loaded with load_dataset(). We re-uploaded the dataset in parquet format. Please download it again. Sorry for the trouble.
OpenCoder-LLM/opc-annealing-corpus · Met ArrowInvalid error
`load_dataset` does not work with uploaded arrow file · Issue #3035 · huggingface/datasets · GitHub

OlivierSchmittSonar · April 30, 2025, 6:50am

Thanks!
So now I have two options:

use the parquet branch: OpenCoder-LLM/opc-annealing-corpus at refs/convert/parquet
try Load and the data_files options (requires playing with URLs and likely wildcards)

It’s a bit frustrating to see the repo contains all the metadata but a specific approach is required or the dataset needs to be duplicated in a another format.

A unified interface loading all sorts of dataset formats would be great; it seems almost implemented because the load_dataset function loads all the arrow files by itself.

Might look into the code to see if I can comme with a change.
Thanks again!
Best.

Topic		Replies	Views
Cannot load dataset on Kaggle 🤗Datasets	4	3118	August 16, 2023
ArrowNotImplementedError when loading json dataset 🤗Datasets	3	1742	December 17, 2021
ArrowTypeError in load_dataset 🤗Datasets	1	626	June 12, 2023
load_dataset("Open-Orca/OpenOrca") ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file 🤗Datasets	1	897	August 16, 2023
LoadDataSet pyarrow.lib.ArrowCapacityError 🤗Datasets	6	266	January 12, 2025

Strange pyarrow error when extracting rows from a public dataset

Related topics