How to load cached dataset offline?

Hello, all!

My computer doesn’t have internet connection. So I have to first download dataset on another computer and copy the dataset to my offline computer.

I use the following code snippet to download wikitext-2-raw-v1 dataset.

from datasets import load_dataset
datasets = load_dataset("wikitext", "wikitext-2-raw-v1")

And I found that some cached files are in the ~/.cache/huggingface/ 's sub dirs.

In the ~/.cache/huggingface/modules/datasets_modules/datasets/wikitext/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126 dir I can see:
__init__.py, __pycache__, dataset_infos.json, wikitext.json, wikitext.py

In the ~/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a1 26 dir I can see:
LICENSE dataset_info.json wikitext-test.arrow wikitext-train.arrow wikitext-validation.arrow

Do I have to copy all those files to the offline computer? Can I change a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126 to other names?

Or how to change those arrow files to csv files?

Hi! You only need the arrow files, but instead of looking for them in cache, it’s more convenient to save the dataset to disk with save_to_disk and transfer the generated folder to another computer, where you can simply load the dataset with load_from_disk("path/to/folder").

Or how to change those arrow files to csv files?

You can use Dataset's to_csv method for that.

1 Like

Thanks for helping me!