Hi,
I try this code in a server with internet connection:
from datasets import load_dataset
wiki = load_dataset("wikipedia", "20200501.en", split="train")
Then automatic downloading process began and there is a folder ~/.cache/huggingface/datasets/wikipedia/20200501.en/1.0.0/4021357e28509391eab2f8300d9b689e7e8f3a877ebb3d354b01577d497ebc63/ which contains wikipedia-train.arrow and some other files.
Now I’d like to use the dataset in a server without internet connection.
What should I do?
I tried it with
from datasets import load_from_disk
wiki = load_from_disk("~/.cache/huggingface/datasets/wikipedia/20200501.en/1.0.0/4021357e28509391eab2f8300d9b689e7e8f3a877ebb3d354b01577d497ebc63"
It showed state.json not found in that folder.
Any advice?
wiki = load_dataset("~/.cache/huggingface/modules/datasets_modules/datasets/wikipedia/4021357e28509391eab2f8300d9b689e7e8f3a877ebb3d354b01577d497ebc63/wikipedia.py", "20200501.en", split="train")
In the recently released 1.3.0 version of datasets, you should also be able to reload your dataset with no internet connection with
wiki = load_dataset("wikipedia", "20200501.en", split="train")
Indeed if there’s no internet connection, this will fall back on the latest wikipedia dataset that you’ve loaded with load_dataset, and notify you with a warning.
In your case it would look for a processed dataset in “~/.cache/huggingface/datasets/wikipedia/20200501.en”
I was wondering what happens when the dataset is already cached but the computer has access to the internet and we use load_dataset() as you did. Will it use from the cache or reload again?