Specifying download directory for custom dataset loading script

You can do

from datasets import load_dataset_builder
builder = load_dataset_builder("SLPL/naab-raw")
builder.download_and_prepare(output_dir)

where output_dir may be a local path or in a cloud storage like “s3://…” for example.

However the way the builder works is by downloading the files locally, and then preparing the dataset to the output_dir. So the cache_dir must be a local directory.

PS: there’s one exception for Beam datasets that can use a cloud storage though, since in this case the processing is distributed on several nodes - in this case cache_dir may be in a cloud storage

1 Like