Create a dataset from generator

There is any way to create a dataset from a generator (without it being loaded into memory). Something similar to

The datasets are not completely read into memory so you should not have to worry about memory usage. It’s mostly fast on-disk access thanks to memory mapping.

We don’t have a .from_generator method yet but this is something we may add !

If you want to generate a dataset from text/json/csv files, then you can do it directly using load_dataset. More information in the documentation

Currently to make a dataset from a custom generator you can make a dataset script that can yield the examples. When calling load_dataset("path/to/my/dataset/script") it will iterate through the generator to write all the examples in an arrow file without loading them into memory. Then a Dataset object will be created containing your data that are memory-mapped from your disk. Memory-mapping allows to load the dataset without loading it into memory.

You can find how to write a dataset script here

1 Like

@shpotes we’re you able to solve this problem, if so would you please share the solution code

You can find how to write a dataset script here

new link is

For those interested, we added .from_generator recently :slight_smile: