How to load dataset that exist in cache path

Hi,
I try this code in a server with internet connection:

from datasets import load_dataset
wiki = load_dataset("wikipedia", "20200501.en", split="train")

Then automatic downloading process began and there is a folder
~/.cache/huggingface/datasets/wikipedia/20200501.en/1.0.0/4021357e28509391eab2f8300d9b689e7e8f3a877ebb3d354b01577d497ebc63/ which contains wikipedia-train.arrow and some other files.
Now I’d like to use the dataset in a server without internet connection.
What should I do?
I tried it with

from datasets import load_from_disk
 wiki = load_from_disk("~/.cache/huggingface/datasets/wikipedia/20200501.en/1.0.0/4021357e28509391eab2f8300d9b689e7e8f3a877ebb3d354b01577d497ebc63"

It showed state.json not found in that folder.
Any advice?

solved it by

wiki = load_dataset("~/.cache/huggingface/modules/datasets_modules/datasets/wikipedia/4021357e28509391eab2f8300d9b689e7e8f3a877ebb3d354b01577d497ebc63/wikipedia.py", "20200501.en", split="train")

Hi !

Glad you managed to solve your issue :slight_smile:

In the recently released 1.3.0 version of datasets, you should also be able to reload your dataset with no internet connection with

wiki = load_dataset("wikipedia", "20200501.en", split="train")

Indeed if there’s no internet connection, this will fall back on the latest wikipedia dataset that you’ve loaded with load_dataset, and notify you with a warning.

In your case it would look for a processed dataset in “~/.cache/huggingface/datasets/wikipedia/20200501.en”

Let me know if you have other questions !

2 Likes

Hi,
I upgrade to 1.3.0.
And it indeed works. Thanks!

1 Like

Hi,

I was wondering what happens when the dataset is already cached but the computer has access to the internet and we use load_dataset() as you did. Will it use from the cache or reload again?

Thank you

It will reload from cache :slight_smile:

(not super important but there is one bug that will be fixed in the next release: Datasets created with `push_to_hub` can't be accessed in offline mode · Issue #3547 · huggingface/datasets · GitHub)