Hi there,
I am trying to create a new dataset: mehdie/sefaria 路 Datasets at Hugging Face
When I try to use it to train a tokenizer, the data itself (data directory) does not get downloaded. I get 0 files and 0 records. Did I do something wrong? I have a custom _split_generators function. Maybe something needs to be done there?
Thanks
Tomer
Hi ! In your script you seem to use glob
but it鈥檚 not necessary.
You can use dl_manager.download()
to download a list of parquet files, and it will return the list of downloaded parquet file paths
Hi,
The dL_manager.download() function returns a list of text files that are around 132bytes and contain Things like this:
version https://git-lfs.github.com/spec/v1
oid sha256:08f7e86610f17d1addd6999e39c969aea926902d0cf616e8d5bd90f41ba124d1
size 7518478
Interestingly, in the same directory each of these files is accompanied by a file with the same name and a .json extension that contains the URL to one of my missing files:
{"url": "https://huggingface.co/datasets/mehdie/sefaria/raw/main/data/Chasidut_english.parquet", "etag": null}
The URLs were not pointing to the file itself, but to its git lfs metadata file. You should use
https://huggingface.co/datasets/mehdie/sefaria/resolve/main/data/Chasidut_english.parquet
instead of
https://huggingface.co/datasets/mehdie/sefaria/raw/main/data/Chasidut_english.parquet