I use snapshot_download to download dataset like this:
from huggingface_hub import snapshot_download
snapshot_download("lvwerra/stack-exchange-paired", repo_type="dataset", resume_download=True,revision='main')
the dataset will be cached at ~/.cache/huggingface/hub/datasets--lvwerra--stack-exchange-paired
, and the struct of the dir is:
blobs refs snapshots
Questions:
1.when I use snapshot_download to download a model, the model will auto-load without redownload, why the dataset cannot autoload after downloaded via snapshot_download without redownload?
2.how to load the datasets downloaded via snapshot_download ?
Thanks!
We currently use a different caching mechanism - our plan to align with huggingface_hub
’s in version 3.0 to make it easier to inspect, etc.
load_dataset
requires data files to be in the datasets
cache, so using load_dataset
with snapshot_download
is not optimal as it saves the data files twice on disk.
Why simply using load_dataset("lvwerra/stack-exchange-paired")
doesn’t work for you?
Thank you,
Of course, load_dataset is simple, but the network is unstable by using load_dataset, snapshot_download is more stable than load_dataset. In China the server of huggingface is pointed to Japan, we cannot access the resources stably. we must download resources with try catch and a for loop via snapshot_download.
Now we use git clone as an alternative, looking forward to the version 3.0 of datasets, do you have the release time of the new version?
I found the max_retries property of download_config, with 10 max_retries and try catch ,we can download the dataset via load_dataset stably.
2 Likes
Now we use git clone as an alternative, looking forward to the version 3.0 of datasets, do you have the release time of the new version?
Before the end of this year probably.
Which on-disk layout will datasets v3 use (snapshot_download, or the current datasets layout)? And is there an updated timeline for release?