I followed the documentation here to set my environment variable like this
import os
os.environ["HF_DATASETS_CACHE"] = os.path.join(os.getcwd(), "cache")
but when I load a custom dataset
from datasets import load_from_disk
my_dataset = load_from_disk("my_dataset")
train_dataset = my_dataset["train"]
the caching directory, from train_dataset.cache_files
, seems to still be pointing to the directory of my_dataset
, see code here
(then self._get_cache_file_path
calls self.cache_files
)
Is this an intended behavior? Is there any way I can cache all intermediate results to HF_DATASETS_CACHE?
also cache_dir
is not supported in load_from_disk