Cache size much larger than `dh cache scan` shows

Hi, I was trying to download three languages of MLCommons/ml_spoken_words. When I ran out of space, I deleted some things and tried with just two of the languages. I deleted more things, moved my OS to a new, much larger partition, and continued to try to download the datasets. When I run hf cache scan, It shows:

MLCommons/ml_spoken_words dataset          60.2G

On the other hand, when I run ncdu, it shows 217 GB for ~/.cache/huggingface, 55 GB in hub/ and 153 GB in datasets/

Meanwhile, load_dataset has a progress bar that has shown 100% for a while, but the cache keeps growing and nethogs shows that it’s using as much bandwidth as it can. I know I can manually delete the cache and try again, but

a) Do I have a large amount of corrupt cache from the interrupted downloads (and how do I clear it without the good downloads), or does huggingface use extra disk space (e.g. to block out a space for transforms)?

b) Why do I have so much in ~/.cache/hub when I’ve only ever downloaded a dataset, not a model?

c) what is load_dataset doing that takes so long after it reaches 100%? Is this a progress bar issue, or is this potentially a memory issue caused by interrupted downloads?

1 Like

oh…


Diagnosis: your numbers make sense. hf cache scan only reports the Hub cache; ncdu sums Hub plus the Datasets Arrow/processed cache plus temporary download caches. 55 GB in hub/ + ~153 GB in datasets/ ≈ your 217 GB. This is normal for large audio datasets with transforms and repeated variant loads. (Hugging Face)

a) Corrupt cache vs “expected extra space”

Mostly expected extra space, not corruption.

  • Two caches exist:

    • Hub cache at ~/.cache/huggingface/hub holds files fetched from the Hub (models and datasets), organized into blobs/, snapshots/, and transient chunk-cache/ and shard-cache/. Those transient directories can be large during big transfers and are safe to delete. (Hugging Face)
    • Datasets cache at ~/.cache/huggingface/datasets holds Arrow shards, downloads/, and downloads/extracted/ produced by datasets. That processed data is not counted by hf cache scan. (Hugging Face)
  • Interrupted runs may leave partials, but the usual bloat is from Arrow shards and extracted archives created during prepare steps. You can clean surgically:

    • Remove just this dataset from the Hub cache with the CLI, then re-download if needed. (Hugging Face)
    • Remove processed Arrow files for this dataset only via Dataset.cleanup_cache_files(). (Hugging Face)
  • If you changed the language list, datasets treats it as a different configuration/fingerprint. That creates new processed caches even when raw files are reused. (Hugging Face)

b) Why ~/.cache/hub is big though you “only used datasets”

Because datasets are also stored under the Hub cache. The Hub cache is shared by all HF libraries and has top-level DATASETS beside MODELS. load_dataset(...) fetches scripts and often the raw data from the Hub, which grows ~/.cache/huggingface/hub. (Hugging Face)

c) Why the bar sits at 100% while disk and network still grow

The tqdm bar usually tracks the download phase only. After 100% it may still be:

  • Extracting archives and verifying content.
  • Converting to Arrow and writing shards to ~/.cache/huggingface/datasets/....
  • Building indices and split files.
    These steps can be long for audio and can look like a “stuck at 100%” bar. Users often report “100% but still running” during filtering/mapping or when using multiple processes. (Hugging Face)

What to do now (surgical, keeps good files)

  1. Inspect Hub usage only
# Hub-only report; excludes ~/.cache/huggingface/datasets
hf cache scan
hf cache scan --dir ~/.cache/huggingface/hub

(Hugging Face)

  1. Prune only MLCommons/ml_spoken_words from the Hub cache
# remove this dataset’s Hub entries only
hf cache delete dataset/MLCommons/ml_spoken_words -y
# or, older CLI
huggingface-cli scan-cache --delete --repos dataset/MLCommons/ml_spoken_words

(Hugging Face)

  1. Prune processed Arrow for this dataset
from datasets import load_dataset
ds = load_dataset("MLCommons/ml_spoken_words", languages=["ar","tt"])
print(ds.cleanup_cache_files())  # returns count of removed files

(Hugging Face)

  1. Pin caches to the large partition so future cleanup is trivial
# Hub cache root (affects hub/*, incl. chunk-cache/shard-cache)
export HF_HOME="/data/hf"

# Datasets Arrow cache
export HF_DATASETS_CACHE="/data/hf/datasets"

(Hugging Face)

  1. If a run was interrupted and you want a clean fetch for just this dataset
from datasets import load_dataset, GenerateMode
ds = load_dataset("MLCommons/ml_spoken_words",
                  languages=["ar","tt"],
                  download_mode=GenerateMode.FORCE_REDOWNLOAD)

(Hugging Face)

  1. If disk is tight, stream instead of caching to disk
from datasets import load_dataset
ds = load_dataset("MLCommons/ml_spoken_words",
                  languages=["ar","tt"],
                  streaming=True, split="train")
first = next(iter(ds))

This avoids Arrow writes and large datasets/ growth. (Hugging Face)

  1. Keep the language list stable
    Changing languages=[...] changes the config name (e.g., "ar+tt"), which triggers a new processed cache. Decide the set once before a full prepare. (GitHub)

Why your 60.2 GB vs 217 GB mismatch is expected

  • hf cache scan summarizes Hub only. It excludes ~/.cache/huggingface/datasets where Arrow and extracted files live. (GitHub)
  • The datasets cache stores Arrow shards and downloads/ and downloads/extracted/, which can exceed the raw download size. (Hugging Face)
  • During active transfers you also have chunk-cache/ and shard-cache/ inside the Hub cache. These raise the live total and are safe to remove when done. (Hugging Face)

Short curated references

Official docs

  • Hub cache layout, CLI, and transient chunk-cache/shard-cache. Explains Hub vs Datasets and safe deletions. (Hugging Face)
  • Datasets cache guide. Explains Arrow shards, downloads/, extracted/, cleanup, and moving the cache. (Hugging Face)
  • Streaming mode. Use datasets without local disk growth. (Hugging Face)

Issues and forum threads

  • scan-cache doesn’t include the Datasets cache. Confirms the undercount. (GitHub)
  • Progress shows 100% but work continues during map/filter or multithreaded steps. (Hugging Face Forums)

Dataset-specific

  • MSWC dataset card and loader usage with multiple languages. (Hugging Face)
  • Implementation shows the config name is the "+".join(languages), so different language sets create different processed caches. (GitHub)

Wait but load_dataset takes a long time. Are you telling me I need to load a dataset in order to delete it’s cache?