How to load cached dataset offline?

Hello, all!

My computer doesn’t have internet connection. So I have to first download dataset on another computer and copy the dataset to my offline computer.

I use the following code snippet to download wikitext-2-raw-v1 dataset.

from datasets import load_dataset
datasets = load_dataset("wikitext", "wikitext-2-raw-v1")

And I found that some cached files are in the ~/.cache/huggingface/ 's sub dirs.

In the ~/.cache/huggingface/modules/datasets_modules/datasets/wikitext/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126 dir I can see:
__init__.py, __pycache__, dataset_infos.json, wikitext.json, wikitext.py

In the ~/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a1 26 dir I can see:
LICENSE dataset_info.json wikitext-test.arrow wikitext-train.arrow wikitext-validation.arrow

Do I have to copy all those files to the offline computer? Can I change a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126 to other names?

Or how to change those arrow files to csv files?

1 Like

Hi! You only need the arrow files, but instead of looking for them in cache, it’s more convenient to save the dataset to disk with save_to_disk and transfer the generated folder to another computer, where you can simply load the dataset with load_from_disk("path/to/folder").

Or how to change those arrow files to csv files?

You can use Dataset’s to_csv method for that.

1 Like

Thanks for helping me!