Cache size much larger than `dh cache scan` shows

John6666 · October 24, 2025, 1:23am

To delete the cache for a specific model or dataset, you can also locate and delete it directly using your OS’s file manager. However, it can be hard to find…

Personally, I recommend using the HF CLI as it’s the most reliable method.

No. You do not need to reload the dataset to delete its cache.

How to delete without loading:

Hub cache (the 60 GB “scan-cache” sees)
Delete the repo from the Hub cache directly. No Python, no dataset load.

# preview
hf cache ls --filter "repo_id==dataset/MLCommons/ml_spoken_words"
hf cache rm dataset/MLCommons/ml_spoken_words --dry-run
# delete
hf cache rm dataset/MLCommons/ml_spoken_words -y
# if your cache lives elsewhere
hf cache rm dataset/MLCommons/ml_spoken_words -y --cache-dir /path/to/hf/hub

This is the supported way to surgically remove a dataset repo from the Hub cache. (Hugging Face)

Datasets processed cache (the ~153 GB under ~/.cache/huggingface/datasets)
You can remove those Arrow/processed files by path. No need to construct a Dataset in Python.

# find the directories for this dataset
find ~/.cache/huggingface/datasets -maxdepth 3 -type d -iname '*ml_spoken_words*' -print

# common space hogs you can delete safely
rm -rf ~/.cache/huggingface/datasets/downloads            # raw archives
rm -rf ~/.cache/huggingface/datasets/downloads/extracted  # extracted archives

# remove only this dataset's processed shards (after confirming paths via `find`)
rm -rf ~/.cache/huggingface/datasets/*ml_spoken_words*

Hugging Face’s Datasets docs and forum confirm: processed caches live under ~/.cache/huggingface/datasets, and it is safe to delete downloads/ and dataset-specific folders when you want to reclaim space. (Hugging Face)

Notes and alternatives:

The cleanup_cache_files() API does require a Dataset object, which implies loading, so skip it if the load is slow and just delete by path as above. The method exists, but it is optional. (Hugging Face)
If you prefer a lighter Python route, you can still avoid a full prepare by cleaning the Hub cache with the CLI, then delete the Datasets cache directories by path; this matches the official cache model split: Hub cache at ~/.cache/huggingface/hub, Datasets cache at ~/.cache/huggingface/datasets. (Hugging Face)

Summary: use hf cache rm ... for the Hub cache, and delete the dataset’s folders under ~/.cache/huggingface/datasets for processed data. No dataset reload required. (Hugging Face)

Topic		Replies	Views
Deleting Duplicate Saved Datasets 🤗Datasets	3	4623	September 7, 2022
Load_dataset without saving cache files 🤗Datasets	1	1882	April 19, 2023
Understanding the `Datasets` cache system 🤗Datasets	2	3386	May 19, 2023
`load_from_cache_file` not working 🤗Datasets	1	2204	May 10, 2021
Loading Huge Image Dataset seems to take a lot of time 🤗Datasets	7	3782	May 16, 2022

Cache size much larger than `dh cache scan` shows

Related topics