Use Git to download datasets but fails to load

jongjyh · March 8, 2024, 4:02am

Hi guys,

I try to download dataset with git lfs clone for Network fluctuations and use load_datasets(path="path_to_git_repo")to load it.

But I recently encounter problem (see error below) here. It seems that load_datasets(path="remote_name" works quite fine while it couldn’t load any local path to a git repo datasets any more. I test two datasets: 1. microsoft/orca-math-word-problems-200k 2. emozilla/pg_books-tokenized-bos-eos-chunked-65536" which leads to same error.

Here is the version i use:

datasets==2.18.0

>>> import datasets; datasets.load_dataset("./yarn/orca-math-word-problem
s-200k")
Generating train split:   0%|          | 0/200035 [00:00<?, ? examples/s]
Traceback (most recent call last):
  File "/apdcephfs_nj4/share_300340418/jongjyh/yarn/lib/python3.8/site-pa
ckages/datasets/builder.py", line 1973, in _prepare_split_single
    for _, table in generator:
  File "/apdcephfs_nj4/share_300340418/jongjyh/yarn/lib/python3.8/site-pa
ckages/datasets/packaged_modules/parquet/parquet.py", line 85, in _genera
te_tables
    parquet_file = pq.ParquetFile(f)
  File "/apdcephfs_nj4/share_300340418/jongjyh/yarn/lib/python3.8/site-pa
ckages/pyarrow/parquet/core.py", line 318, in __init__
    self.reader.open(
  File "pyarrow/_parquet.pyx", line 1470, in pyarrow._parquet.ParquetReader.open
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Eithe$ the file is corrupted or this is not a parquet file.

Any solution would be appreciated it!!

severo · March 8, 2024, 9:19am

cc @lhoestq

lhoestq · March 8, 2024, 3:03pm

Hi ! Have you installed Git LFS prior to pulling ? Maybe the files in your git repos are just the Git LFS pointers ?

Also I’d recommend using the huggingface-cli to download repos: https://huggingface.co/docs/huggingface_hub/v0.21.4/en/guides/cli#download-an-entire-repository

jongjyh · March 8, 2024, 4:08pm

Thanks for your kindly reply! @lhoestq @severo

I fixed this by using another server to download datasets as I found downloaded dataset had been corruputed for incompatible openssl.

I still tried the solution provided by @lhoestq for more convenience while I found something strange:

the default cache_dir for huggingface-cli is ~/.cache/huggingface/hub
the default cache_dir for datasets is ~/.cache/huggingface/datasets
datasets.load_datasets loads datasets follows the second cache_dir which is exacltly different from 1

I use some datasets to check if there were the case. I assume using huggingface-cli to download datasets may still a little be tricky rn. I’d appciated it If you can check if there were the case. @lhoestq

jongjyh · March 8, 2024, 4:20pm

Additionally, I do find download datasets by huggingface-cli is a more safe way.

I avoid problem I mentioned in the beginning without introducing another server by following below solution:

use huggingface-cli to download datasets to local dir. For example

huggingface-cli download microsoft/orca-math-word-problems-200k  --repo-type dataset --local-dir ./d1/ --local_dir_use_symlinks=False

loads datasets to a local dir

import datasets; datasets.load_dataset("./d1")

It works in my case.

Topic		Replies	Views
Is there a way to load dataset lfs files on github private repo? 🤗Datasets	1	432	March 13, 2023
Load_dataset error (.incomplete/parquet-validation-00000-00000-of-NNNNN.arrow') 🤗Datasets	1	789	February 12, 2024
How to load a huggingface dataset from local path? Beginners	5	6689	July 18, 2024
Load_dataset can't find hosted public .parquet files? Beginners	4	1061	May 21, 2024
Using load_datasets for newly created datasets 🤗Datasets	2	455	August 27, 2021

Use Git to download datasets but fails to load

Related topics