Map result saved to a different folder than custom HF_DATASETS_CACHE

luckynozomi · June 10, 2022, 10:05pm

I followed the documentation here to set my environment variable like this

import os
os.environ["HF_DATASETS_CACHE"] = os.path.join(os.getcwd(), "cache")

but when I load a custom dataset

from datasets import load_from_disk
my_dataset = load_from_disk("my_dataset")
train_dataset = my_dataset["train"]

the caching directory, from train_dataset.cache_files, seems to still be pointing to the directory of my_dataset, see code here

(then self._get_cache_file_path calls self.cache_files)

Is this an intended behavior? Is there any way I can cache all intermediate results to HF_DATASETS_CACHE?

also cache_dir is not supported in load_from_disk

mariosasko · June 14, 2022, 1:23pm

Hi! That’s expected. load_from_disk loads arrow files from a given directory, so train_dataset.cache_files should point to my_dataset/train/dataset.arrow (there is nothing to cache here).

Is this an intended behavior? Is there any way I can cache all intermediate results to HF_DATASETS_CACHE?

Yes, if you save your dataset to HF_DATASETS_CACHE, but this is not advised as it can lead to name collisions (this directory stores the arrow files generated by load_dataset or map when called on the datasets from there).
.

Topic		Replies	Views
Cache for custom data loader Intermediate	1	588	September 23, 2022
Duplicated cache- arrow files when uploading large folder? 🤗Datasets	2	32	April 7, 2025
Change cache directory Beginners	1	2920	November 1, 2022
Load dataset from a specific cache file 🤗Datasets	3	1216	February 26, 2024
Loading dataset from cache .arrow file 🤗Datasets	1	742	March 28, 2023

Map result saved to a different folder than custom HF_DATASETS_CACHE

Related topics