How to use load_dataset the dataset downloaded via snapshot_download?

I use snapshot_download to download dataset like this:

from huggingface_hub import snapshot_download
snapshot_download("lvwerra/stack-exchange-paired", repo_type="dataset", resume_download=True,revision='main')

the dataset will be cached at ~/.cache/huggingface/hub/datasets--lvwerra--stack-exchange-paired, and the struct of the dir is:
blobs refs snapshots

Questions:
1.when I use snapshot_download to download a model, the model will auto-load without redownload, why the dataset cannot autoload after downloaded via snapshot_download without redownload?

2.how to load the datasets downloaded via snapshot_download ?

Thanks!

We currently use a different caching mechanism - our plan to align with huggingface_hub’s in version 3.0 to make it easier to inspect, etc.

load_dataset requires data files to be in the datasets cache, so using load_dataset with snapshot_download is not optimal as it saves the data files twice on disk.

Why simply using load_dataset("lvwerra/stack-exchange-paired") doesn’t work for you?

Thank you,
Of course, load_dataset is simple, but the network is unstable by using load_dataset, snapshot_download is more stable than load_dataset. In China the server of huggingface is pointed to Japan, we cannot access the resources stably. we must download resources with try catch and a for loop via snapshot_download.
Now we use git clone as an alternative, looking forward to the version 3.0 of datasets, do you have the release time of the new version?

I found the max_retries property of download_config, with 10 max_retries and try catch ,we can download the dataset via load_dataset stably.

2 Likes

Now we use git clone as an alternative, looking forward to the version 3.0 of datasets, do you have the release time of the new version?

Before the end of this year probably.