How to load dataset that exist in cache path

zuujhyt · February 18, 2021, 1:23pm

Hi,
I try this code in a server with internet connection:

from datasets import load_dataset
wiki = load_dataset("wikipedia", "20200501.en", split="train")

Then automatic downloading process began and there is a folder
~/.cache/huggingface/datasets/wikipedia/20200501.en/1.0.0/4021357e28509391eab2f8300d9b689e7e8f3a877ebb3d354b01577d497ebc63/ which contains wikipedia-train.arrow and some other files.
Now I’d like to use the dataset in a server without internet connection.
What should I do?
I tried it with

from datasets import load_from_disk
 wiki = load_from_disk("~/.cache/huggingface/datasets/wikipedia/20200501.en/1.0.0/4021357e28509391eab2f8300d9b689e7e8f3a877ebb3d354b01577d497ebc63"

It showed state.json not found in that folder.
Any advice?

zuujhyt · February 18, 2021, 3:27pm

solved it by

wiki = load_dataset("~/.cache/huggingface/modules/datasets_modules/datasets/wikipedia/4021357e28509391eab2f8300d9b689e7e8f3a877ebb3d354b01577d497ebc63/wikipedia.py", "20200501.en", split="train")

lhoestq · February 19, 2021, 3:04pm

Hi !

Glad you managed to solve your issue

In the recently released 1.3.0 version of datasets, you should also be able to reload your dataset with no internet connection with

wiki = load_dataset("wikipedia", "20200501.en", split="train")

Indeed if there’s no internet connection, this will fall back on the latest wikipedia dataset that you’ve loaded with load_dataset, and notify you with a warning.

In your case it would look for a processed dataset in “~/.cache/huggingface/datasets/wikipedia/20200501.en”

Let me know if you have other questions !

zuujhyt · February 24, 2021, 6:56am

Hi,
I upgrade to 1.3.0.
And it indeed works. Thanks!

bengisucam · December 6, 2023, 4:25pm

Hi,

I was wondering what happens when the dataset is already cached but the computer has access to the internet and we use load_dataset() as you did. Will it use from the cache or reload again?

Thank you

lhoestq · December 6, 2023, 4:44pm

It will reload from cache

(not super important but there is one bug that will be fixed in the next release: Datasets created with `push_to_hub` can't be accessed in offline mode · Issue #3547 · huggingface/datasets · GitHub)

Topic		Replies	Views
Load dataset from cache in offline mode 🤗Datasets	1	1701	January 23, 2023
How to load cached dataset offline? Beginners	2	4608	May 29, 2022
Datasets not using the cache dir 🤗Datasets	2	729	November 29, 2023
Can't use datasets offline, even if I have uploaded the datasets to .cache dir 🤗Datasets	10	8058	December 1, 2022
Loading Dataset from Cache Data Intermediate	1	141	September 30, 2024

How to load dataset that exist in cache path

Related topics