Strange pyarrow error when extracting rows from a public dataset

Hi,

I’m trying to download and process rows from

I’m using the generic load_dataset function (using datasets = “==3.5.1”):
rows = load_dataset(path="OpenCoder-LLM/opc-annealing-corpus", name="algorithmic_corpus", split="train")

I can download and generate the split:

Downloading data: 100%|██████████| 34/34 [00:00<00:00, 86.59files/s]
Generating train split: 10579390 examples [00:07, 1452745.44 examples/s]

But just after building the split it seems load_dataset tries to load this file as an arrow file:

https://huggingface.co/datasets/OpenCoder-LLM/opc-annealing-corpus/blob/main/algorithmic_corpus/state.json

And this leads to an error:

ERROR: Failed to read file '/Users/<HOME>/.cache/huggingface/hub/datasets--OpenCoder-LLM--opc-annealing-corpus/snapshots/cb08b1b40bb19f88e3c4f48d6b4647bed588fc04/algorithmic_corpus/state.json' with error <class 'pyarrow.lib.ArrowInvalid'>: Not an Arrow file

pyarrow.lib.ArrowInvalid: Expected to read 538970747 metadata bytes, but only read 2131

Which makes sense because state.json is not an arrow file , it’s describing the dataset.
I’d expect load_dataset since it’s able to load the proper arrow files to not consider this Json file as an arrow file.
Looks like a bug.

Any idea how to fix this?
Thanks a lot.

1 Like

Hmm… PyArrow format datasets may not be loaded using the load_dataset function.

hyxmmm
It seems that the arrow dataset is not supported to be loaded with load_dataset(). We re-uploaded the dataset in parquet format. Please download it again. Sorry for the trouble.
OpenCoder-LLM/opc-annealing-corpus · Met ArrowInvalid error
`load_dataset` does not work with uploaded arrow file · Issue #3035 · huggingface/datasets · GitHub

Thanks!
So now I have two options:

It’s a bit frustrating to see the repo contains all the metadata but a specific approach is required or the dataset needs to be duplicated in a another format.

A unified interface loading all sorts of dataset formats would be great; it seems almost implemented because the load_dataset function loads all the arrow files by itself.

Might look into the code to see if I can comme with a change.
Thanks again!
Best.

1 Like