Specifying download directory for custom dataset loading script

handsanitizer · October 28, 2021, 2:30am

I’m following this tutorial for making a custom dataset loading script that is callable through datasets.load_dataset(). In the section about downloading data files and organizing splits, it says that datasets.DatasetBuilder._split_generators() takes a datasets.DownloadManager as input. When loading the dataset, the loader always downloads to /root/.cache/a/randomly/generated/path. Is there a way I can tell the datasets.DownloadManager to download and read from a specific directory through datasets.load_dataset()?

mariosasko · October 28, 2021, 10:11pm

Hi,

by default, the download directory is set to ~/.cache/huggingface/downloads. To change the location, either set the HF_DATASETS_DOWNLOADED_DATASETS_PATH env variable to a different value or modify the DOWNLOADED_DATASETS_PATH variable:

import datasets
from pathlib import Path
datasets.config.DOWNLOADED_DATASETS_PATH = Path(target_path)
datasets.load_datase(...)

However, what you want to do in most cases is not only to change the location of the download directory but to change the entire cache directory to ignore the old cache. You can control this easily with the cache_dir argument in load_dataset (or with the HF_DATASETS_CACHE env variable /datasets.config.HF_DATASETS_CACHE)

saied · October 13, 2022, 10:30am

I have a question.
How can I change my cache dir to a cloud storage like S3 bucket right now I have this code

from botocore.session import Session
import s3fs
from datasets import load_dataset_builder
import datasets
storage_options = {"key": "XXX", 
                "secret": "XXX"}

s3_session = Session(profile="hf2S3")
storage_options = {"session": s3_session}

fs = s3fs.S3FileSystem(**storage_options)
output_dir = "s3://path/to/my/bucket/"

builder = load_dataset_builder("SLPL/naab-raw")


builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet")

I modified codes from this post: Cloud storage but my dataset size is huge(~130G) and I wanted to transfer cache to S3 too.
I modified codes like this:

builder = load_dataset_builder("SLPL/naab-raw", cache_dir=output_dir)

but I got this error

AttributeError: 'S3' object has no attribute '__aenter__'. Did you mean: '__delattr__'?

which according to search I’ve done might be a network issue.
do you have any solution to this problem?
Thanks in advance

g3casey · October 27, 2022, 12:58pm

I know this is an old post but maybe this will still help someone.

Most libraries don’t know s3 file paths so it does not know how to interpret or use “s3:…”
You can try something like
from smart_open import open
Which will override the standard open() with smart open but if that doesn’t work you will need to write locally.
Also note that writing to s3 is much slower than writing locally.

lhoestq · November 4, 2022, 3:22pm

You can do

from datasets import load_dataset_builder
builder = load_dataset_builder("SLPL/naab-raw")
builder.download_and_prepare(output_dir)

where output_dir may be a local path or in a cloud storage like “s3://…” for example.

However the way the builder works is by downloading the files locally, and then preparing the dataset to the output_dir. So the cache_dir must be a local directory.

PS: there’s one exception for Beam datasets that can use a cloud storage though, since in this case the processing is distributed on several nodes - in this case cache_dir may be in a cloud storage

intelav · April 25, 2023, 6:13am

Does below still holds true for setting the dataset download path ? For me , its not changing the defauly download path.

import datasets
from pathlib import Path
datasets.config.DOWNLOADED_DATASETS_PATH = Path(“/media/avaish/Ext/datasets”)

lhoestq · May 2, 2023, 12:33pm

Hi ! DOWNLOADED_DATASETS_PATH is the directory where the data is downloaded. However the cache directory where the dataset is prepared as Arrow files is HF_DATASETS_CACHE

Topic		Replies	Views
Cache for custom data loader Intermediate	1	588	September 23, 2022
Download_custom method of StreamingDownloadManager not implemented 🤗Datasets	8	898	August 21, 2023
Using config_kwargs within the load_dataset Beginners	2	974	September 20, 2023
Map result saved to a different folder than custom HF_DATASETS_CACHE 🤗Datasets	1	672	June 14, 2022
Providing a custom data path to reuse dataset script 🤗Datasets	1	776	December 21, 2022

Specifying download directory for custom dataset loading script

Related topics