Specifying download directory for custom dataset loading script

lhoestq · November 4, 2022, 3:22pm

You can do

from datasets import load_dataset_builder
builder = load_dataset_builder("SLPL/naab-raw")
builder.download_and_prepare(output_dir)

where output_dir may be a local path or in a cloud storage like “s3://…” for example.

However the way the builder works is by downloading the files locally, and then preparing the dataset to the output_dir. So the cache_dir must be a local directory.

PS: there’s one exception for Beam datasets that can use a cloud storage though, since in this case the processing is distributed on several nodes - in this case cache_dir may be in a cloud storage

Topic		Replies	Views
How to write a dataset load script using private S3 storage 🤗Datasets	2	1354	December 1, 2022
How do I customize .cache/huggingface Beginners	2	2895	November 1, 2022
Datasets not using the cache dir 🤗Datasets	2	729	November 29, 2023
Map result saved to a different folder than custom HF_DATASETS_CACHE 🤗Datasets	1	676	June 14, 2022
Download location 🤗Datasets	3	2223	March 22, 2024

Specifying download directory for custom dataset loading script

Related topics