Hi guys,
I try to download dataset with git lfs clone
for Network fluctuations and use load_datasets(path="path_to_git_repo")
to load it.
But I recently encounter problem (see error below) here. It seems that load_datasets(path="remote_name"
works quite fine while it couldn’t load any local path to a git repo datasets any more. I test two datasets: 1. microsoft/orca-math-word-problems-200k 2. emozilla/pg_books-tokenized-bos-eos-chunked-65536" which leads to same error.
Here is the version i use:
datasets==2.18.0
>>> import datasets; datasets.load_dataset("./yarn/orca-math-word-problem
s-200k")
Generating train split: 0%| | 0/200035 [00:00<?, ? examples/s]
Traceback (most recent call last):
File "/apdcephfs_nj4/share_300340418/jongjyh/yarn/lib/python3.8/site-pa
ckages/datasets/builder.py", line 1973, in _prepare_split_single
for _, table in generator:
File "/apdcephfs_nj4/share_300340418/jongjyh/yarn/lib/python3.8/site-pa
ckages/datasets/packaged_modules/parquet/parquet.py", line 85, in _genera
te_tables
parquet_file = pq.ParquetFile(f)
File "/apdcephfs_nj4/share_300340418/jongjyh/yarn/lib/python3.8/site-pa
ckages/pyarrow/parquet/core.py", line 318, in __init__
self.reader.open(
File "pyarrow/_parquet.pyx", line 1470, in pyarrow._parquet.ParquetReader.open
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Eithe$ the file is corrupted or this is not a parquet file.
Any solution would be appreciated it!!