When I update my loading script in my dataset repo, the load_dataset function still use the cached local loading script. I have tried to set verification_mode=“all_checks”, but it still use out-of-date local loading script. Seems that this is relevant:
opened 09:35PM - 02 Dec 23 UTC
### Describe the bug
When a dataset is updated on the hub, using `load_dataset`… will load the locally cached dataset instead of re-downloading the updated dataset
### Steps to reproduce the bug
Here is a minimal example script to
1. create an initial dataset and upload
2. download it so it is stored in cache
3. change the dataset and re-upload
4. redownload
```python
import time
from datasets import Dataset, DatasetDict, DownloadMode, load_dataset
username = "YOUR_USERNAME_HERE"
initial = Dataset.from_dict({"foo": [1, 2, 3]})
print(f"Intial {initial['foo']}")
initial_ds = DatasetDict({"train": initial})
initial_ds.push_to_hub("test")
time.sleep(1)
download = load_dataset(f"{username}/test", split="train")
changed = download.map(lambda x: {"foo": x["foo"] + 1})
print(f"Changed {changed['foo']}")
changed.push_to_hub("test")
time.sleep(1)
download_again = load_dataset(f"{username}/test", split="train")
print(f"Download Changed {download_again['foo']}")
# >>> gives the out-dated [1,2,3] when it should be changed [2,3,4]
```
The redownloaded dataset should be the changed dataset but it is actually the cached, initial dataset. Force-redownloading gives the correct dataset
```python
download_again_force = load_dataset(f"{username}/test", split="train", download_mode=DownloadMode.FORCE_REDOWNLOAD)
print(f"Force Download Changed {download_again_force['foo']}")
# >>> [2,3,4]
```
### Expected behavior
I assumed there should be some sort of hashing that should check for changes in the dataset and re-download if the hashes don't match
### Environment info
- `datasets` version: 2.15.0 │
- Platform: Linux-5.15.0-1028-nvidia-x86_64-with-glibc2.17 │
- Python version: 3.8.17 │
- `huggingface_hub` version: 0.19.4 │
- PyArrow version: 13.0.0 │
- Pandas version: 2.0.3 │
- `fsspec` version: 2023.6.0
The issue says that dataset not using loading script works correctly:
huggingface:main
← huggingface:retrieve-cached-no-script-datasets
opened 04:56PM - 29 Nov 23 UTC
I drafted the logic to retrieve a no-script dataset in the cache.
For example i… t can reload datasets that were pushed to hub if they exist in the cache.
example:
```python
>>> Dataset.from_dict({"a": [1, 2]}).push_to_hub("lhoestq/tmp")
>>> load_dataset("lhoestq/tmp")
DatasetDict({
train: Dataset({
features: ['a'],
num_rows: 2
})
})
```
and later, without connection:
```python
>>> load_dataset("lhoestq/tmp")
Using the latest cached version of the dataset from /Users/quentinlhoest/.cache/huggingface/datasets/lhoestq___tmp/*/*/0b3caccda1725efb(last modified on Wed Nov 29 16:50:27 2023) since it couldn't be found locally at lhoestq/tmp.
DatasetDict({
train: Dataset({
features: ['a'],
num_rows: 2
})
})
```
fix https://github.com/huggingface/datasets/issues/3547
## Implementation details (EDITED)
I continued in https://github.com/huggingface/datasets/pull/6493, see the changes there
TODO:
- [x] tests
- [ ] compatible with https://github.com/huggingface/datasets/pull/6458