Specifying download directory for custom dataset loading script

I’m following this tutorial for making a custom dataset loading script that is callable through datasets.load_dataset(). In the section about downloading data files and organizing splits, it says that datasets.DatasetBuilder._split_generators() takes a datasets.DownloadManager as input. When loading the dataset, the loader always downloads to /root/.cache/a/randomly/generated/path. Is there a way I can tell the datasets.DownloadManager to download and read from a specific directory through datasets.load_dataset()?

1 Like

Hi,

by default, the download directory is set to ~/.cache/huggingface/downloads. To change the location, either set the HF_DATASETS_DOWNLOADED_DATASETS_PATH env variable to a different value or modify the DOWNLOADED_DATASETS_PATH variable:

import datasets
from pathlib import Path
datasets.config.DOWNLOADED_DATASETS_PATH = Path(target_path)
datasets.load_datase(...)

However, what you want to do in most cases is not only to change the location of the download directory but to change the entire cache directory to ignore the old cache. You can control this easily with the cache_dir argument in load_dataset (or with the HF_DATASETS_CACHE env variable /datasets.config.HF_DATASETS_CACHE)

1 Like