Cache size much larger than `dh cache scan` shows

To delete the cache for a specific model or dataset, you can also locate and delete it directly using your OS’s file manager. However, it can be hard to find…:sweat_smile:

Personally, I recommend using the HF CLI as it’s the most reliable method.


No. You do not need to reload the dataset to delete its cache.

How to delete without loading:

  1. Hub cache (the 60 GB “scan-cache” sees)
    Delete the repo from the Hub cache directly. No Python, no dataset load.
# preview
hf cache ls --filter "repo_id==dataset/MLCommons/ml_spoken_words"
hf cache rm dataset/MLCommons/ml_spoken_words --dry-run
# delete
hf cache rm dataset/MLCommons/ml_spoken_words -y
# if your cache lives elsewhere
hf cache rm dataset/MLCommons/ml_spoken_words -y --cache-dir /path/to/hf/hub

This is the supported way to surgically remove a dataset repo from the Hub cache. (Hugging Face)

  1. Datasets processed cache (the ~153 GB under ~/.cache/huggingface/datasets)
    You can remove those Arrow/processed files by path. No need to construct a Dataset in Python.
# find the directories for this dataset
find ~/.cache/huggingface/datasets -maxdepth 3 -type d -iname '*ml_spoken_words*' -print

# common space hogs you can delete safely
rm -rf ~/.cache/huggingface/datasets/downloads            # raw archives
rm -rf ~/.cache/huggingface/datasets/downloads/extracted  # extracted archives

# remove only this dataset's processed shards (after confirming paths via `find`)
rm -rf ~/.cache/huggingface/datasets/*ml_spoken_words*

Hugging Face’s Datasets docs and forum confirm: processed caches live under ~/.cache/huggingface/datasets, and it is safe to delete downloads/ and dataset-specific folders when you want to reclaim space. (Hugging Face)

Notes and alternatives:

  • The cleanup_cache_files() API does require a Dataset object, which implies loading, so skip it if the load is slow and just delete by path as above. The method exists, but it is optional. (Hugging Face)
  • If you prefer a lighter Python route, you can still avoid a full prepare by cleaning the Hub cache with the CLI, then delete the Datasets cache directories by path; this matches the official cache model split: Hub cache at ~/.cache/huggingface/hub, Datasets cache at ~/.cache/huggingface/datasets. (Hugging Face)

Summary: use hf cache rm ... for the Hub cache, and delete the dataset’s folders under ~/.cache/huggingface/datasets for processed data. No dataset reload required. (Hugging Face)