Hi,
I’m trying to download and process rows from
I’m using the generic load_dataset function (using datasets = “==3.5.1”):
rows = load_dataset(path="OpenCoder-LLM/opc-annealing-corpus", name="algorithmic_corpus", split="train")
I can download and generate the split:
Downloading data: 100%|██████████| 34/34 [00:00<00:00, 86.59files/s]
Generating train split: 10579390 examples [00:07, 1452745.44 examples/s]
But just after building the split it seems load_dataset tries to load this file as an arrow file:
https://huggingface.co/datasets/OpenCoder-LLM/opc-annealing-corpus/blob/main/algorithmic_corpus/state.json
And this leads to an error:
ERROR: Failed to read file '/Users/<HOME>/.cache/huggingface/hub/datasets--OpenCoder-LLM--opc-annealing-corpus/snapshots/cb08b1b40bb19f88e3c4f48d6b4647bed588fc04/algorithmic_corpus/state.json' with error <class 'pyarrow.lib.ArrowInvalid'>: Not an Arrow file
pyarrow.lib.ArrowInvalid: Expected to read 538970747 metadata bytes, but only read 2131
Which makes sense because state.json is not an arrow file , it’s describing the dataset.
I’d expect load_dataset since it’s able to load the proper arrow files to not consider this Json file as an arrow file.
Looks like a bug.
Any idea how to fix this?
Thanks a lot.