Load dataset from cache in offline mode

Suppose I load 2 files from the c4 dataset and cache it in a particular folder by doing something like:

dataset = load_dataset("allenai/c4", data_files={"train": ["en/c4-train.00000-of-01024.json.gz", "en/c4-train.00001-of-01024.json.gz"]}, split="train", cache_dir="temp_cache")

Then, I would like to load the same files of the c4 dataset from the cache_dir mentioned above in offline_mode : os.environ["HF_DATASETS_OFFLINE"] = "1".

Can you suggest me what needs to be done for the same?

cc. @albertvillanova @mariosasko @lhoestq

This is not supported right now, though this can be fixed at the same time as Datasets created with `push_to_hub` can't be accessed in offline mode · Issue #3547 · huggingface/datasets · GitHub IMO

In the meantime you can save and reload your dataset using .save_to_disk() and load_from_disk()

1 Like