I’m trying Dataset.from_generator() loading function, it is awesome! However, on some runs, it simply skips my changes to the generator function. How to stop it from caching?
I tried:
Hmm can you share more details about your generator ? It’s not supposed to skip you changes.
We haven’t added the argument to from_generator() to ask to always use a new cache yet, feel free to open an issue on GitHub to mention this issue if you’re interested.
Note that every time you modify the gen_kwargs passed to from_generator it will use a new cache, you can leverage this as a workaround.
I was also able to observe this. My generator relies on external data written to the disk so skipping it is not intended. I was able to solve the problem by passing a random value to the generator that is ignored:
from datasets import Dataset
import random
def generator(ignored):
# ... do something
pass
dataset = Dataset.from_generator(generator, gen_kwargs={"ignored":random.random()})