How to use load_dataset the dataset downloaded via snapshot_download?

sinodeveloper · August 28, 2023, 6:29pm

I use snapshot_download to download dataset like this：

from huggingface_hub import snapshot_download
snapshot_download("lvwerra/stack-exchange-paired", repo_type="dataset", resume_download=True,revision='main')

the dataset will be cached at ~/.cache/huggingface/hub/datasets--lvwerra--stack-exchange-paired, and the struct of the dir is:
blobs refs snapshots

Questions:
1.when I use snapshot_download to download a model, the model will auto-load without redownload, why the dataset cannot autoload after downloaded via snapshot_download without redownload?

2.how to load the datasets downloaded via snapshot_download ?

Thanks!

mariosasko · August 28, 2023, 11:12pm

We currently use a different caching mechanism - our plan to align with huggingface_hub’s in version 3.0 to make it easier to inspect, etc.

load_dataset requires data files to be in the datasets cache, so using load_dataset with snapshot_download is not optimal as it saves the data files twice on disk.

Why simply using load_dataset("lvwerra/stack-exchange-paired") doesn’t work for you?

sinodeveloper · August 29, 2023, 5:58am

Thank you,
Of course, load_dataset is simple, but the network is unstable by using load_dataset, snapshot_download is more stable than load_dataset. In China the server of huggingface is pointed to Japan, we cannot access the resources stably. we must download resources with try catch and a for loop via snapshot_download.
Now we use git clone as an alternative, looking forward to the version 3.0 of datasets, do you have the release time of the new version?

I found the max_retries property of download_config, with 10 max_retries and try catch ,we can download the dataset via load_dataset stably.

mariosasko · August 31, 2023, 7:07pm

Now we use git clone as an alternative, looking forward to the version 3.0 of datasets, do you have the release time of the new version?

Before the end of this year probably.

sunnyg · July 8, 2024, 2:00am

Which on-disk layout will datasets v3 use (snapshot_download, or the current datasets layout)? And is there an updated timeline for release?

Topic		Replies	Views
Permanently saving dataset with load_dataset 🤗Datasets	1	809	December 7, 2021
Load dataset from files already downloaded 🤗Datasets	1	136	May 6, 2024
How to load dataset that exist in cache path Beginners	5	4962	December 6, 2023
Loading downloaded dataset from local directory 🤗Datasets	0	239	April 20, 2024
Hide download count on dataset page 🤗Datasets	3	63	July 14, 2024

How to use load_dataset the dataset downloaded via snapshot_download?

Related topics