Knowing if a dataset will be loaded from cache beforehand

Is there a way to know if a given parameters set of load_dataset will result in the dataset being loaded from cache or downloaded/built using a custom script ?

Hi ! Yes you can load the dataset builder and check if its associated cache directory already exists:

import os

from datasets import load_dataset_builder


builder = load_dataset_builder(ds_name_or_path, **ds_config_params)
is_cached = os.path.exists(self._cache_dir)
1 Like