Create a dataset from generator

deathcrush · January 15, 2024, 11:49am

@lhoestq I was wondering where can I find some detailed information about how caching works (in particular about what informs the decision about loading from cache or doing the processing again)?

I am generating datasets for inference on-the-fly during evaluate and predict and I passed the cache_dir option so that the generated datasets are generated only under the main process (so I’m using Trainer's accelerator.main_process_first() context manager during evaluate calls). However, when evaluating at different points I need to generate the dataset again. Am I correct in thinking that I should wipe off the cache/generator in between eval calls to ensure the evaluation does not used previously generated datasets?

Topic		Replies	Views
[ Dataset.from_generator ] Prevent caching during upload 🤗Datasets	2	268	May 15, 2024
How does Dataset.from_generator store data bigger than RAM? 🤗Datasets	1	38	June 19, 2025
Is from_generator() caching? how to stop it? 🤗Datasets	2	643	June 27, 2025
Serially creating a very large dataset using from_generator(), slower each iteration (slows to >2-3s per example!) 🤗Datasets	1	781	May 18, 2023
Pipeline with custom dataset tokenizer: when to save/load manually 🤗Datasets	18	5646	September 18, 2020

Create a dataset from generator

Related topics