There is no working way of explicitly disabling the cache, aside from passing
cache_dir=tempfile.mkdtemp()as an argument or changing thegen_kwargs.
Maybe almost yes.
On datasets==4.2.0 you cannot turn off the initial Arrow write for Dataset.from_generator. disable_caching() only stops transform-level reuse. To avoid stale reuse you must change the fingerprint: either vary gen_kwargs (4.2.0) or pass fingerprint= (4.3.0+). To avoid writes at all, use IterableDataset or streaming. (Hugging Face Forums)
Background
- A map-style
Datasetis backed by Arrow files on disk. The library fingerprints dataset state and reuses cache by default.disable_caching()affects.map()/.filter()reloads, not the initial build. (Hugging Face) - Maintainers state there is no “always new cache” flag for
from_generator; changinggen_kwargsis the suggested workaround in 4.2.x. (Hugging Face Forums) - Since 4.3.0,
from_generatoracceptsfingerprint=to control cache identity directly. This does not prevent the Arrow write. (GitHub) IterableDatasetand streaming mode load examples lazily and, by design, do not write dataset shards to disk. (Hugging Face)
Good practices to “disable” cache when creating from a generator
1) Decide: stop reuse vs stop writes
- Stop reuse (fresh build each time): set a unique fingerprint for the dataset instance.
- Stop writes (no Arrow files at all): switch to
IterableDatasetorload_dataset(..., streaming=True). (Hugging Face)
2) If you must stay on datasets==4.2.0
Use a file-dependent token inside gen_kwargs so the fingerprint changes whenever the underlying file changes.
# docs: cache & disable_caching scope, from_generator behavior
# https://huggingface.co/docs/datasets/en/cache
# https://discuss.huggingface.co/t/is-from-generator-caching-how-to-stop-it/70013
import os, hashlib, h5py
from datasets import Dataset
def file_sig(path: str, n=1<<20) -> str:
"""Cheap content signature: size, mtime, and SHA256 of head+tail."""
st = os.stat(path)
h = hashlib.sha256()
with open(path, "rb") as f:
head = f.read(n)
f.seek(max(0, st.st_size - n))
tail = f.read(n)
h.update(head); h.update(tail)
return f"{st.st_size}:{int(st.st_mtime)}:{h.hexdigest()}"
def h5_to_hf_dataset_generator(path, tokenizer, **_):
with h5py.File(path, "r") as f:
for k in f:
tok = tokenizer.protein_encode(text=f[k].attrs["text"], padding=False, truncation=False)
yield {"attention_mask": tok["attention_mask"], "input_ids": tok["input_ids"]}
sig = file_sig(path)
ds = Dataset.from_generator(
h5_to_hf_dataset_generator,
gen_kwargs={"path": path, "tokenizer": tokenizer, "_source_sig": sig},
)
Rationale: identical gen_kwargs reuse the same cache. Changing them forces a rebuild. Core team endorses this workaround for 4.2.x. (Hugging Face Forums)
3) If you can upgrade to datasets>=4.3.0
Pass a deterministic fingerprint tied to the data file. This controls reuse without bloating gen_kwargs.
# release notes and signature reference
# https://github.com/huggingface/datasets/releases/tag/4.3.0
# https://huggingface.co/docs/datasets/main/en/package_reference/main_classes
import os, hashlib, h5py
from datasets import Dataset
def file_fingerprint(path: str) -> str:
st = os.stat(path)
return f"{st.st_size}:{int(st.st_mtime)}"
def h5_to_hf_dataset_generator(path, tokenizer):
with h5py.File(path, "r") as f:
for k in f:
tok = tokenizer.protein_encode(text=f[k].attrs["text"], padding=False, truncation=False)
yield {"attention_mask": tok["attention_mask"], "input_ids": tok["input_ids"]}
fp = file_fingerprint(path)
ds = Dataset.from_generator(
h5_to_hf_dataset_generator,
gen_kwargs={"path": path, "tokenizer": tokenizer},
fingerprint=fp,
)
fingerprint= was added in 4.3.0. It changes cache identity but still writes Arrow. (GitHub)
4) Avoid writes entirely: use streaming or IterableDataset
Two ways that don’t materialize Arrow shards:
# A) Pure generator streaming
# docs: Dataset vs IterableDataset and streaming guide
# https://huggingface.co/docs/datasets/en/about_mapstyle_vs_iterable
# https://huggingface.co/docs/datasets/en/stream
from datasets import IterableDataset
ids = IterableDataset.from_generator(lambda: h5_to_hf_dataset_generator(path, tokenizer))
for ex in ids:
pass # iterate or map lazily
# B) Use the built-in HDF5 loader in streaming mode (when your HDF5 is tabular)
# docs PR adds HDF5 loader doc; 4.3.0 release improves HDF5 streaming
# https://github.com/huggingface/datasets/pull/7740.patch
# https://github.com/huggingface/datasets/releases/tag/4.3.0
from datasets import load_dataset
ids = load_dataset("hdf5", data_files=path, split="train", streaming=True)
for ex in ids:
pass
Streaming says “don’t download or cache anything,” and the IterableDataset page notes you “don’t write anything on disk.” (Hugging Face)
5) Sandbox or clean caches when you must use Dataset
- Per-run scratch cache: set an ephemeral dir.
# docs: HF_DATASETS_CACHE / HF_HOME
# https://huggingface.co/docs/datasets/en/cache
export HF_DATASETS_CACHE="$(mktemp -d)"
- Per-call scratch cache:
Dataset.from_generator(..., cache_dir=tempfile.mkdtemp()). - Clean stale shards:
dataset.cleanup_cache_files()after heavy transforms. (Hugging Face)
6) Know the limits of disable_caching()
Use it to force recomputation of transforms like .map(...), not to stop the initial build or load_dataset write. This is documented and acknowledged by maintainers. (Hugging Face)
7) Fingerprinting tips
- Keep
gen_kwargssmall. Large objects slow default hashing in 4.2.x. Prefer a compact_source_sig. (Hugging Face) - Inspect what’s on disk:
dataset.cache_fileslists backing Arrow shards. (Hugging Face)
Practical playbook
- Stay on 4.2.0 and need fresh data per file → add a file-dependent token to
gen_kwargsas shown. (Hugging Face Forums) - Need deterministic control over reuse → upgrade to 4.3.0 and pass
fingerprint=. (GitHub) - Need zero disk writes → use
IterableDatasetor streaming. (Hugging Face) - Want isolation only → set
HF_DATASETS_CACHEor per-callcache_dir. Clean withcleanup_cache_files(). (Hugging Face)
Small, curated references
- Maintainer guidance: “no always-new-cache flag; vary
gen_kwargs.” (Hugging Face Forums) - Cache model and the scope of
disable_caching(). (Hugging Face) from_generatorsignature andfingerprint=(added 4.3.0). (Hugging Face)- Release notes: “Add custom fingerprint support to
from_generator” and HDF5 streaming improvements. (GitHub) - Streaming and IterableDataset do not write shards to disk. (Hugging Face)
Bottom line: You cannot fully “disable cache” for Dataset.from_generator. Choose one: vary the fingerprint to avoid reuse, or switch to streaming/IterableDataset to avoid writes. (Hugging Face Forums)