Map result saved to a different folder than custom HF_DATASETS_CACHE

I followed the documentation here to set my environment variable like this

import os
os.environ["HF_DATASETS_CACHE"] = os.path.join(os.getcwd(), "cache")

but when I load a custom dataset

from datasets import load_from_disk
my_dataset = load_from_disk("my_dataset")
train_dataset = my_dataset["train"]

the caching directory, from train_dataset.cache_files, seems to still be pointing to the directory of my_dataset, see code here

(then self._get_cache_file_path calls self.cache_files)

Is this an intended behavior? Is there any way I can cache all intermediate results to HF_DATASETS_CACHE?

also cache_dir is not supported in load_from_disk

Hi! That’s expected. load_from_disk loads arrow files from a given directory, so train_dataset.cache_files should point to my_dataset/train/dataset.arrow (there is nothing to cache here).

Is this an intended behavior? Is there any way I can cache all intermediate results to HF_DATASETS_CACHE?

Yes, if you save your dataset to HF_DATASETS_CACHE, but this is not advised as it can lead to name collisions (this directory stores the arrow files generated by load_dataset or map when called on the datasets from there).
.