Datasets use out of date loading script

ljw20180420 · September 3, 2024, 3:50am

When I update my loading script in my dataset repo, the load_dataset function still use the cached local loading script. I have tried to set verification_mode=“all_checks”, but it still use out-of-date local loading script. Seems that this is relevant:

github.com/huggingface/datasets

`load_dataset` uses out-of-date cache instead of re-downloading a changed dataset

opened 09:35PM - 02 Dec 23 UTC

mnoukhov

### Describe the bug When a dataset is updated on the hub, using `load_dataset`… will load the locally cached dataset instead of re-downloading the updated dataset ### Steps to reproduce the bug Here is a minimal example script to 1. create an initial dataset and upload 2. download it so it is stored in cache 3. change the dataset and re-upload 4. redownload ```python import time from datasets import Dataset, DatasetDict, DownloadMode, load_dataset username = "YOUR_USERNAME_HERE" initial = Dataset.from_dict({"foo": [1, 2, 3]}) print(f"Intial {initial['foo']}") initial_ds = DatasetDict({"train": initial}) initial_ds.push_to_hub("test") time.sleep(1) download = load_dataset(f"{username}/test", split="train") changed = download.map(lambda x: {"foo": x["foo"] + 1}) print(f"Changed {changed['foo']}") changed.push_to_hub("test") time.sleep(1) download_again = load_dataset(f"{username}/test", split="train") print(f"Download Changed {download_again['foo']}") # >>> gives the out-dated [1,2,3] when it should be changed [2,3,4] ``` The redownloaded dataset should be the changed dataset but it is actually the cached, initial dataset. Force-redownloading gives the correct dataset ```python download_again_force = load_dataset(f"{username}/test", split="train", download_mode=DownloadMode.FORCE_REDOWNLOAD) print(f"Force Download Changed {download_again_force['foo']}") # >>> [2,3,4] ``` ### Expected behavior I assumed there should be some sort of hashing that should check for changes in the dataset and re-download if the hashes don't match ### Environment info - `datasets` version: 2.15.0 │ - Platform: Linux-5.15.0-1028-nvidia-x86_64-with-glibc2.17 │ - Python version: 3.8.17 │ - `huggingface_hub` version: 0.19.4 │ - PyArrow version: 13.0.0 │ - Pandas version: 2.0.3 │ - `fsspec` version: 2023.6.0

The issue says that dataset not using loading script works correctly:

github.com/huggingface/datasets

Retrieve cached datasets that were pushed to hub when offline

huggingface:main ← huggingface:retrieve-cached-no-script-datasets

opened 04:56PM - 29 Nov 23 UTC

lhoestq

+340 -110

I drafted the logic to retrieve a no-script dataset in the cache. For example i…t can reload datasets that were pushed to hub if they exist in the cache. example: ```python >>> Dataset.from_dict({"a": [1, 2]}).push_to_hub("lhoestq/tmp") >>> load_dataset("lhoestq/tmp") DatasetDict({ train: Dataset({ features: ['a'], num_rows: 2 }) }) ``` and later, without connection: ```python >>> load_dataset("lhoestq/tmp") Using the latest cached version of the dataset from /Users/quentinlhoest/.cache/huggingface/datasets/lhoestq___tmp/*/*/0b3caccda1725efb(last modified on Wed Nov 29 16:50:27 2023) since it couldn't be found locally at lhoestq/tmp. DatasetDict({ train: Dataset({ features: ['a'], num_rows: 2 }) }) ``` fix https://github.com/huggingface/datasets/issues/3547 ## Implementation details (EDITED) I continued in https://github.com/huggingface/datasets/pull/6493, see the changes there TODO: - [x] tests - [ ] compatible with https://github.com/huggingface/datasets/pull/6458

Topic		Replies	Views
How to load dataset that exist in cache path Beginners	5	4967	December 6, 2023
Long-term reproducibility for `load_dataset`? 🤗Datasets	2	142	January 8, 2025
Does load_dataset load the data in to the memory? 🤗Datasets	1	497	February 22, 2021
How to use load_dataset the dataset downloaded via snapshot_download? 🤗Datasets	4	1685	July 8, 2024
`load_from_cache_file` not working 🤗Datasets	1	2164	May 10, 2021

Datasets use out of date loading script

Related topics