You can do
from datasets import load_dataset_builder
builder = load_dataset_builder("SLPL/naab-raw")
builder.download_and_prepare(output_dir)
where output_dir
may be a local path or in a cloud storage like “s3://…” for example.
However the way the builder works is by downloading the files locally, and then preparing the dataset to the output_dir
. So the cache_dir
must be a local directory.
PS: there’s one exception for Beam datasets that can use a cloud storage though, since in this case the processing is distributed on several nodes - in this case cache_dir
may be in a cloud storage