Specifying download directory for custom dataset loading script

I’m following this tutorial for making a custom dataset loading script that is callable through datasets.load_dataset(). In the section about downloading data files and organizing splits, it says that datasets.DatasetBuilder._split_generators() takes a datasets.DownloadManager as input. When loading the dataset, the loader always downloads to /root/.cache/a/randomly/generated/path. Is there a way I can tell the datasets.DownloadManager to download and read from a specific directory through datasets.load_dataset()?

1 Like

Hi,

by default, the download directory is set to ~/.cache/huggingface/downloads. To change the location, either set the HF_DATASETS_DOWNLOADED_DATASETS_PATH env variable to a different value or modify the DOWNLOADED_DATASETS_PATH variable:

import datasets
from pathlib import Path
datasets.config.DOWNLOADED_DATASETS_PATH = Path(target_path)
datasets.load_datase(...)

However, what you want to do in most cases is not only to change the location of the download directory but to change the entire cache directory to ignore the old cache. You can control this easily with the cache_dir argument in load_dataset (or with the HF_DATASETS_CACHE env variable /datasets.config.HF_DATASETS_CACHE)

1 Like

I have a question.
How can I change my cache dir to a cloud storage like S3 bucket right now I have this code

from botocore.session import Session
import s3fs
from datasets import load_dataset_builder
import datasets
storage_options = {"key": "XXX", 
                "secret": "XXX"}

s3_session = Session(profile="hf2S3")
storage_options = {"session": s3_session}

fs = s3fs.S3FileSystem(**storage_options)
output_dir = "s3://path/to/my/bucket/"

builder = load_dataset_builder("SLPL/naab-raw")


builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet")

I modified codes from this post: Cloud storage but my dataset size is huge(~130G) and I wanted to transfer cache to S3 too.
I modified codes like this:

builder = load_dataset_builder("SLPL/naab-raw", cache_dir=output_dir)

but I got this error

AttributeError: 'S3' object has no attribute '__aenter__'. Did you mean: '__delattr__'?

which according to search I’ve done might be a network issue.
do you have any solution to this problem?
Thanks in advance :slightly_smiling_face:

I know this is an old post but maybe this will still help someone.

Most libraries don’t know s3 file paths so it does not know how to interpret or use “s3:…”
You can try something like
from smart_open import open
Which will override the standard open() with smart open but if that doesn’t work you will need to write locally.
Also note that writing to s3 is much slower than writing locally.

1 Like

You can do

from datasets import load_dataset_builder
builder = load_dataset_builder("SLPL/naab-raw")
builder.download_and_prepare(output_dir)

where output_dir may be a local path or in a cloud storage like “s3://…” for example.

However the way the builder works is by downloading the files locally, and then preparing the dataset to the output_dir. So the cache_dir must be a local directory.

PS: there’s one exception for Beam datasets that can use a cloud storage though, since in this case the processing is distributed on several nodes - in this case cache_dir may be in a cloud storage

1 Like

Does below still holds true for setting the dataset download path ? For me , its not changing the defauly download path.

import datasets
from pathlib import Path
datasets.config.DOWNLOADED_DATASETS_PATH = Path(“/media/avaish/Ext/datasets”)

Hi ! DOWNLOADED_DATASETS_PATH is the directory where the data is downloaded. However the cache directory where the dataset is prepared as Arrow files is HF_DATASETS_CACHE