Is from_generator() caching? how to stop it?

I’m trying Dataset.from_generator() loading function, it is awesome! However, on some runs, it simply skips my changes to the generator function. How to stop it from caching?
I tried:

disable_caching()
datasets.set_caching_enabled(False)

Thank you

1 Like

Hmm can you share more details about your generator ? It’s not supposed to skip you changes.

We haven’t added the argument to from_generator() to ask to always use a new cache yet, feel free to open an issue on GitHub to mention this issue if you’re interested.

Note that every time you modify the gen_kwargs passed to from_generator it will use a new cache, you can leverage this as a workaround.

1 Like

I was also able to observe this. My generator relies on external data written to the disk so skipping it is not intended. I was able to solve the problem by passing a random value to the generator that is ignored:

from datasets import Dataset
import random

def generator(ignored):
  # ... do something
  pass
dataset = Dataset.from_generator(generator, gen_kwargs={"ignored":random.random()})
1 Like