Use Git to download datasets but fails to load

Hi guys,

I try to download dataset with git lfs clone for Network fluctuations and use load_datasets(path="path_to_git_repo")to load it.

But I recently encounter problem (see error below) here. It seems that load_datasets(path="remote_name" works quite fine while it couldn’t load any local path to a git repo datasets any more. I test two datasets: 1. microsoft/orca-math-word-problems-200k 2. emozilla/pg_books-tokenized-bos-eos-chunked-65536" which leads to same error.

Here is the version i use:

datasets==2.18.0
>>> import datasets; datasets.load_dataset("./yarn/orca-math-word-problem
s-200k")
Generating train split:   0%|          | 0/200035 [00:00<?, ? examples/s]
Traceback (most recent call last):
  File "/apdcephfs_nj4/share_300340418/jongjyh/yarn/lib/python3.8/site-pa
ckages/datasets/builder.py", line 1973, in _prepare_split_single
    for _, table in generator:
  File "/apdcephfs_nj4/share_300340418/jongjyh/yarn/lib/python3.8/site-pa
ckages/datasets/packaged_modules/parquet/parquet.py", line 85, in _genera
te_tables
    parquet_file = pq.ParquetFile(f)
  File "/apdcephfs_nj4/share_300340418/jongjyh/yarn/lib/python3.8/site-pa
ckages/pyarrow/parquet/core.py", line 318, in __init__
    self.reader.open(
  File "pyarrow/_parquet.pyx", line 1470, in pyarrow._parquet.ParquetReader.open
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Eithe$ the file is corrupted or this is not a parquet file.

Any solution would be appreciated it!!

cc @lhoestq

Hi ! Have you installed Git LFS prior to pulling ? Maybe the files in your git repos are just the Git LFS pointers ?

Also I’d recommend using the huggingface-cli to download repos: https://huggingface.co/docs/huggingface_hub/v0.21.4/en/guides/cli#download-an-entire-repository

Thanks for your kindly reply! @lhoestq @severo

I fixed this by using another server to download datasets as I found downloaded dataset had been corruputed for incompatible openssl.

I still tried the solution provided by @lhoestq for more convenience while I found something strange:

  1. the default cache_dir for huggingface-cli is ~/.cache/huggingface/hub
  2. the default cache_dir for datasets is ~/.cache/huggingface/datasets
  3. datasets.load_datasets loads datasets follows the second cache_dir which is exacltly different from 1

I use some datasets to check if there were the case. I assume using huggingface-cli to download datasets may still a little be tricky rn. I’d appciated it If you can check if there were the case. @lhoestq :blush:

Additionally, I do find download datasets by huggingface-cli is a more safe way.

I avoid problem I mentioned in the beginning without introducing another server by following below solution:

  1. use huggingface-cli to download datasets to local dir. For example
huggingface-cli download microsoft/orca-math-word-problems-200k  --repo-type dataset --local-dir ./d1/ --local_dir_use_symlinks=False
  1. loads datasets to a local dir
import datasets; datasets.load_dataset("./d1")

It works in my case.

1 Like