Dataset not downloading

tomersagi · April 15, 2023, 7:45am

Hi there,
I am trying to create a new dataset: mehdie/sefaria · Datasets at Hugging Face
When I try to use it to train a tokenizer, the data itself (data directory) does not get downloaded. I get 0 files and 0 records. Did I do something wrong? I have a custom _split_generators function. Maybe something needs to be done there?
Thanks
Tomer

lhoestq · April 15, 2023, 7:35pm

Hi ! In your script you seem to use glob but it’s not necessary.

You can use dl_manager.download() to download a list of parquet files, and it will return the list of downloaded parquet file paths

tomersagi · April 17, 2023, 6:31am

Hi,
The dL_manager.download() function returns a list of text files that are around 132bytes and contain Things like this:

version https://git-lfs.github.com/spec/v1
oid sha256:08f7e86610f17d1addd6999e39c969aea926902d0cf616e8d5bd90f41ba124d1
size 7518478

Interestingly, in the same directory each of these files is accompanied by a file with the same name and a .json extension that contains the URL to one of my missing files:

{"url": "https://huggingface.co/datasets/mehdie/sefaria/raw/main/data/Chasidut_english.parquet", "etag": null}

lhoestq · April 19, 2023, 10:23am

The URLs were not pointing to the file itself, but to its git lfs metadata file. You should use

https://huggingface.co/datasets/mehdie/sefaria/resolve/main/data/Chasidut_english.parquet

instead of

https://huggingface.co/datasets/mehdie/sefaria/raw/main/data/Chasidut_english.parquet

Topic		Replies	Views
How to download data from hugging face that is visible on the data viewer but the files are not available? Beginners	7	3438	August 15, 2023
Dataset Viewer for dataset with downloadable data 🤗Datasets	3	23	March 6, 2025
Downloading a portion of parquet files 🤗Datasets	3	650	May 23, 2024
Uploading an audio dataset keeps failing at "Uploading the dataset shards" Beginners	2	348	March 15, 2024
Error while downloading my dataset 🤗Datasets	2	1116	June 21, 2023

Dataset not downloading

Related topics