Tried both datasets the-stack-dedup and the-stack .
After spending past 5 days downloading and extracting I get a bug:
Extracting data files: 100%|βββββββββββββββ| 1/1 [03:32<00:00, 212.39s/it]
Traceback (most recent call last):
File β/media/env/lib/python3.11/site-packages/datasets/builder.pyβ, line 1858, in _prepare_split_single
for _, table in generator:
File β/media/env/lib/python3.11/site-packages/datasets/packaged_modules/parquet/parquet.pyβ, line 67, in _generate_tables
parquet_file = pq.ParquetFile(f)
^^^^^^^^^^^^^^^^^
File β/media/env/lib/python3.11/site-packages/pyarrow/parquet/core.pyβ, line 334, in init
self.reader.open(
File βpyarrow/_parquet.pyxβ, line 1220, in pyarrow._parquet.ParquetReader.open
File βpyarrow/error.pxiβ, line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File ββ, line 1, in
File β/media/env/lib/python3.11/site-packages/datasets/load.pyβ, line 1797, in load_dataset
builder_instance.download_and_prepare(
File β/media/env/lib/python3.11/site-packages/datasets/builder.pyβ, line 890, in download_and_prepare
self._download_and_prepare(
File β/media/env/lib/python3.11/site-packages/datasets/builder.pyβ, line 985, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File β/media/env/lib/python3.11/site-packages/datasets/builder.pyβ, line 1746, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File β/media/env/lib/python3.11/site-packages/datasets/builder.pyβ, line 1891, in _prepare_split_single
raise DatasetGenerationError(βAn error occurred while generating the datasetβ) from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset