No way of disabling cache when creating dataset from generator

There is no working way of explicitly disabling the cache, aside from passing cache_dir=tempfile.mkdtemp() as an argument or changing the gen_kwargs.

Maybe almost yes.


On datasets==4.2.0 you cannot turn off the initial Arrow write for Dataset.from_generator. disable_caching() only stops transform-level reuse. To avoid stale reuse you must change the fingerprint: either vary gen_kwargs (4.2.0) or pass fingerprint= (4.3.0+). To avoid writes at all, use IterableDataset or streaming. (Hugging Face Forums)

Background

  • A map-style Dataset is backed by Arrow files on disk. The library fingerprints dataset state and reuses cache by default. disable_caching() affects .map()/.filter() reloads, not the initial build. (Hugging Face)
  • Maintainers state there is no “always new cache” flag for from_generator; changing gen_kwargs is the suggested workaround in 4.2.x. (Hugging Face Forums)
  • Since 4.3.0, from_generator accepts fingerprint= to control cache identity directly. This does not prevent the Arrow write. (GitHub)
  • IterableDataset and streaming mode load examples lazily and, by design, do not write dataset shards to disk. (Hugging Face)

Good practices to “disable” cache when creating from a generator

1) Decide: stop reuse vs stop writes

  • Stop reuse (fresh build each time): set a unique fingerprint for the dataset instance.
  • Stop writes (no Arrow files at all): switch to IterableDataset or load_dataset(..., streaming=True). (Hugging Face)

2) If you must stay on datasets==4.2.0

Use a file-dependent token inside gen_kwargs so the fingerprint changes whenever the underlying file changes.

# docs: cache & disable_caching scope, from_generator behavior
# https://huggingface.co/docs/datasets/en/cache
# https://discuss.huggingface.co/t/is-from-generator-caching-how-to-stop-it/70013

import os, hashlib, h5py
from datasets import Dataset

def file_sig(path: str, n=1<<20) -> str:
    """Cheap content signature: size, mtime, and SHA256 of head+tail."""
    st = os.stat(path)
    h = hashlib.sha256()
    with open(path, "rb") as f:
        head = f.read(n)
        f.seek(max(0, st.st_size - n))
        tail = f.read(n)
    h.update(head); h.update(tail)
    return f"{st.st_size}:{int(st.st_mtime)}:{h.hexdigest()}"

def h5_to_hf_dataset_generator(path, tokenizer, **_):
    with h5py.File(path, "r") as f:
        for k in f:
            tok = tokenizer.protein_encode(text=f[k].attrs["text"], padding=False, truncation=False)
            yield {"attention_mask": tok["attention_mask"], "input_ids": tok["input_ids"]}

sig = file_sig(path)
ds = Dataset.from_generator(
    h5_to_hf_dataset_generator,
    gen_kwargs={"path": path, "tokenizer": tokenizer, "_source_sig": sig},
)

Rationale: identical gen_kwargs reuse the same cache. Changing them forces a rebuild. Core team endorses this workaround for 4.2.x. (Hugging Face Forums)

3) If you can upgrade to datasets>=4.3.0

Pass a deterministic fingerprint tied to the data file. This controls reuse without bloating gen_kwargs.

# release notes and signature reference
# https://github.com/huggingface/datasets/releases/tag/4.3.0
# https://huggingface.co/docs/datasets/main/en/package_reference/main_classes
import os, hashlib, h5py
from datasets import Dataset

def file_fingerprint(path: str) -> str:
    st = os.stat(path)
    return f"{st.st_size}:{int(st.st_mtime)}"

def h5_to_hf_dataset_generator(path, tokenizer):
    with h5py.File(path, "r") as f:
        for k in f:
            tok = tokenizer.protein_encode(text=f[k].attrs["text"], padding=False, truncation=False)
            yield {"attention_mask": tok["attention_mask"], "input_ids": tok["input_ids"]}

fp = file_fingerprint(path)
ds = Dataset.from_generator(
    h5_to_hf_dataset_generator,
    gen_kwargs={"path": path, "tokenizer": tokenizer},
    fingerprint=fp,
)

fingerprint= was added in 4.3.0. It changes cache identity but still writes Arrow. (GitHub)

4) Avoid writes entirely: use streaming or IterableDataset

Two ways that don’t materialize Arrow shards:

# A) Pure generator streaming
# docs: Dataset vs IterableDataset and streaming guide
# https://huggingface.co/docs/datasets/en/about_mapstyle_vs_iterable
# https://huggingface.co/docs/datasets/en/stream
from datasets import IterableDataset
ids = IterableDataset.from_generator(lambda: h5_to_hf_dataset_generator(path, tokenizer))
for ex in ids:
    pass  # iterate or map lazily
# B) Use the built-in HDF5 loader in streaming mode (when your HDF5 is tabular)
# docs PR adds HDF5 loader doc; 4.3.0 release improves HDF5 streaming
# https://github.com/huggingface/datasets/pull/7740.patch
# https://github.com/huggingface/datasets/releases/tag/4.3.0
from datasets import load_dataset
ids = load_dataset("hdf5", data_files=path, split="train", streaming=True)
for ex in ids:
    pass

Streaming says “don’t download or cache anything,” and the IterableDataset page notes you “don’t write anything on disk.” (Hugging Face)

5) Sandbox or clean caches when you must use Dataset

  • Per-run scratch cache: set an ephemeral dir.
# docs: HF_DATASETS_CACHE / HF_HOME
# https://huggingface.co/docs/datasets/en/cache
export HF_DATASETS_CACHE="$(mktemp -d)"
  • Per-call scratch cache: Dataset.from_generator(..., cache_dir=tempfile.mkdtemp()).
  • Clean stale shards: dataset.cleanup_cache_files() after heavy transforms. (Hugging Face)

6) Know the limits of disable_caching()

Use it to force recomputation of transforms like .map(...), not to stop the initial build or load_dataset write. This is documented and acknowledged by maintainers. (Hugging Face)

7) Fingerprinting tips

  • Keep gen_kwargs small. Large objects slow default hashing in 4.2.x. Prefer a compact _source_sig. (Hugging Face)
  • Inspect what’s on disk: dataset.cache_files lists backing Arrow shards. (Hugging Face)

Practical playbook

  • Stay on 4.2.0 and need fresh data per file → add a file-dependent token to gen_kwargs as shown. (Hugging Face Forums)
  • Need deterministic control over reuse → upgrade to 4.3.0 and pass fingerprint=. (GitHub)
  • Need zero disk writes → use IterableDataset or streaming. (Hugging Face)
  • Want isolation only → set HF_DATASETS_CACHE or per-call cache_dir. Clean with cleanup_cache_files(). (Hugging Face)

Small, curated references

  • Maintainer guidance: “no always-new-cache flag; vary gen_kwargs.” (Hugging Face Forums)
  • Cache model and the scope of disable_caching(). (Hugging Face)
  • from_generator signature and fingerprint= (added 4.3.0). (Hugging Face)
  • Release notes: “Add custom fingerprint support to from_generator” and HDF5 streaming improvements. (GitHub)
  • Streaming and IterableDataset do not write shards to disk. (Hugging Face)

Bottom line: You cannot fully “disable cache” for Dataset.from_generator. Choose one: vary the fingerprint to avoid reuse, or switch to streaming/IterableDataset to avoid writes. (Hugging Face Forums)