Create a dataset from generator

@lhoestq I was wondering where can I find some detailed information about how caching works (in particular about what informs the decision about loading from cache or doing the processing again)?

I am generating datasets for inference on-the-fly during evaluate and predict and I passed the cache_dir option so that the generated datasets are generated only under the main process (so I’m using Trainer's accelerator.main_process_first() context manager during evaluate calls). However, when evaluating at different points I need to generate the dataset again. Am I correct in thinking that I should wipe off the cache/generator in between eval calls to ensure the evaluation does not used previously generated datasets?