IterableDataset.from_generator with iterator

Hi Friends :wave:
I want to create an IterableDataset from an iterator like this

def gen():
     sample = next(iterator)
     yield sample
ds = Dataset.from_generator(gen)

but it throws an error
TypeError: cannot pickle 'generator' object

Is there a way to create a dataset like this?

Currently from_generator requires picklable generator functions (this is because we hash the function using pickle to be able to cache the dataset on disk)

A workaround is to use a callable picklable class (i.e. that implements __reduce__ to be picklable and __call__ to yield the samples)

In the future we might allow non-picklable generators (e.g. generators based on iterators) and throw a warning to say that the cache is not used in this case.

1 Like

Thank you :slight_smile: I ended up using this hack Support custom fingerprinting with `Dataset.from_generator` · Issue #6194 · huggingface/datasets · GitHub