Create a dataset from generator

There is any way to create a dataset from a generator (without it being loaded into memory). Something similar to tf.data.Dataset.from_generator

The datasets are not completely read into memory so you should not have to worry about memory usage. It’s mostly fast on-disk access thanks to memory mapping.

We don’t have a .from_generator method yet but this is something we may add !

If you want to generate a dataset from text/json/csv files, then you can do it directly using load_dataset. More information in the documentation

Currently to make a dataset from a custom generator you can make a dataset script that can yield the examples. When calling load_dataset("path/to/my/dataset/script") it will iterate through the generator to write all the examples in an arrow file without loading them into memory. Then a Dataset object will be created containing your data that are memory-mapped from your disk. Memory-mapping allows to load the dataset without loading it into memory.

You can find how to write a dataset script here