Dataset generation error after downloading all the parquet files

Tried both datasets the-stack-dedup and the-stack .
After spending past 5 days downloading and extracting I get a bug:

Extracting data files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [03:32<00:00, 212.39s/it]
Traceback (most recent call last):
File β€œ/media/env/lib/python3.11/site-packages/datasets/”, line 1858, in _prepare_split_single
for _, table in generator:
File β€œ/media/env/lib/python3.11/site-packages/datasets/packaged_modules/parquet/”, line 67, in _generate_tables
parquet_file = pq.ParquetFile(f)
File β€œ/media/env/lib/python3.11/site-packages/pyarrow/parquet/”, line 334, in init
File β€œpyarrow/_parquet.pyx”, line 1220, in
File β€œpyarrow/error.pxi”, line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File β€œβ€, line 1, in
File β€œ/media/env/lib/python3.11/site-packages/datasets/”, line 1797, in load_dataset
File β€œ/media/env/lib/python3.11/site-packages/datasets/”, line 890, in download_and_prepare
File β€œ/media/env/lib/python3.11/site-packages/datasets/”, line 985, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File β€œ/media/env/lib/python3.11/site-packages/datasets/”, line 1746, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File β€œ/media/env/lib/python3.11/site-packages/datasets/”, line 1891, in _prepare_split_single
raise DatasetGenerationError(β€œAn error occurred while generating the dataset”) from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

This is the error:

pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

So the files you are passing to load_dataset are not Parquet files.

I’m getting the same error when downloading the OpenOrca dataset from HuggingFace. These aren’t my parquet files, they are on huggingface.

Was this ever resolved?

I’m not able to reproduce the error. Most likely, one of the downloaded Parquet files is corrupted in your case, so pass downlaod_mode="force_redownload" to load_dataset to re-download them.

Thanks for the response. The problem was the datasets version. I was using v2.13 and when I upgraded to the latest the problem went away. Open-Orca/Mistral-7B-OpenOrca Β· OpenOrca Dataset 'Fail to generate dataset'

1 Like

Thanks! This also solved my problem.