oh…
Diagnosis: your numbers make sense. hf cache scan
only reports the Hub cache; ncdu
sums Hub plus the Datasets Arrow/processed cache plus temporary download caches. 55 GB in hub/
+ ~153 GB in datasets/
≈ your 217 GB. This is normal for large audio datasets with transforms and repeated variant loads. (Hugging Face)
a) Corrupt cache vs “expected extra space”
Mostly expected extra space, not corruption.
-
Two caches exist:
- Hub cache at
~/.cache/huggingface/hub
holds files fetched from the Hub (models and datasets), organized into blobs/
, snapshots/
, and transient chunk-cache/
and shard-cache/
. Those transient directories can be large during big transfers and are safe to delete. (Hugging Face)
- Datasets cache at
~/.cache/huggingface/datasets
holds Arrow shards, downloads/
, and downloads/extracted/
produced by datasets
. That processed data is not counted by hf cache scan
. (Hugging Face)
-
Interrupted runs may leave partials, but the usual bloat is from Arrow shards and extracted archives created during prepare steps. You can clean surgically:
- Remove just this dataset from the Hub cache with the CLI, then re-download if needed. (Hugging Face)
- Remove processed Arrow files for this dataset only via
Dataset.cleanup_cache_files()
. (Hugging Face)
-
If you changed the language list, datasets
treats it as a different configuration/fingerprint. That creates new processed caches even when raw files are reused. (Hugging Face)
b) Why ~/.cache/hub
is big though you “only used datasets”
Because datasets are also stored under the Hub cache. The Hub cache is shared by all HF libraries and has top-level DATASETS
beside MODELS
. load_dataset(...)
fetches scripts and often the raw data from the Hub, which grows ~/.cache/huggingface/hub
. (Hugging Face)
c) Why the bar sits at 100% while disk and network still grow
The tqdm bar usually tracks the download phase only. After 100% it may still be:
- Extracting archives and verifying content.
- Converting to Arrow and writing shards to
~/.cache/huggingface/datasets/...
.
- Building indices and split files.
These steps can be long for audio and can look like a “stuck at 100%” bar. Users often report “100% but still running” during filtering/mapping or when using multiple processes. (Hugging Face)
What to do now (surgical, keeps good files)
- Inspect Hub usage only
# Hub-only report; excludes ~/.cache/huggingface/datasets
hf cache scan
hf cache scan --dir ~/.cache/huggingface/hub
(Hugging Face)
- Prune only MLCommons/ml_spoken_words from the Hub cache
# remove this dataset’s Hub entries only
hf cache delete dataset/MLCommons/ml_spoken_words -y
# or, older CLI
huggingface-cli scan-cache --delete --repos dataset/MLCommons/ml_spoken_words
(Hugging Face)
- Prune processed Arrow for this dataset
from datasets import load_dataset
ds = load_dataset("MLCommons/ml_spoken_words", languages=["ar","tt"])
print(ds.cleanup_cache_files()) # returns count of removed files
(Hugging Face)
- Pin caches to the large partition so future cleanup is trivial
# Hub cache root (affects hub/*, incl. chunk-cache/shard-cache)
export HF_HOME="/data/hf"
# Datasets Arrow cache
export HF_DATASETS_CACHE="/data/hf/datasets"
(Hugging Face)
- If a run was interrupted and you want a clean fetch for just this dataset
from datasets import load_dataset, GenerateMode
ds = load_dataset("MLCommons/ml_spoken_words",
languages=["ar","tt"],
download_mode=GenerateMode.FORCE_REDOWNLOAD)
(Hugging Face)
- If disk is tight, stream instead of caching to disk
from datasets import load_dataset
ds = load_dataset("MLCommons/ml_spoken_words",
languages=["ar","tt"],
streaming=True, split="train")
first = next(iter(ds))
This avoids Arrow writes and large datasets/
growth. (Hugging Face)
- Keep the language list stable
Changing languages=[...]
changes the config name (e.g., "ar+tt"
), which triggers a new processed cache. Decide the set once before a full prepare. (GitHub)
Why your 60.2 GB vs 217 GB mismatch is expected
hf cache scan
summarizes Hub only. It excludes ~/.cache/huggingface/datasets
where Arrow and extracted files live. (GitHub)
- The datasets cache stores Arrow shards and
downloads/
and downloads/extracted/
, which can exceed the raw download size. (Hugging Face)
- During active transfers you also have
chunk-cache/
and shard-cache/
inside the Hub cache. These raise the live total and are safe to remove when done. (Hugging Face)
Short curated references
Official docs
- Hub cache layout, CLI, and transient
chunk-cache
/shard-cache
. Explains Hub vs Datasets and safe deletions. (Hugging Face)
- Datasets cache guide. Explains Arrow shards,
downloads/
, extracted/
, cleanup, and moving the cache. (Hugging Face)
- Streaming mode. Use datasets without local disk growth. (Hugging Face)
Issues and forum threads
scan-cache
doesn’t include the Datasets cache. Confirms the undercount. (GitHub)
- Progress shows 100% but work continues during map/filter or multithreaded steps. (Hugging Face Forums)
Dataset-specific
- MSWC dataset card and loader usage with multiple languages. (Hugging Face)
- Implementation shows the config name is the
"+".join(languages)
, so different language sets create different processed caches. (GitHub)