Create a dataset from generator

We don’t have a .from_generator method yet but this is something we may add !

If you want to generate a dataset from text/json/csv files, then you can do it directly using load_dataset. More information in the documentation

Currently to make a dataset from a custom generator you can make a dataset script that can yield the examples. When calling load_dataset("path/to/my/dataset/script") it will iterate through the generator to write all the examples in an arrow file without loading them into memory. Then a Dataset object will be created containing your data that are memory-mapped from your disk. Memory-mapping allows to load the dataset without loading it into memory.

You can find how to write a dataset script here

1 Like