Create a dataset from generator

There is any way to create a dataset from a generator (without it being loaded into memory). Something similar to tf.data.Dataset.from_generator

The datasets are not completely read into memory so you should not have to worry about memory usage. It’s mostly fast on-disk access thanks to memory mapping.

We don’t have a .from_generator method yet but this is something we may add !

If you want to generate a dataset from text/json/csv files, then you can do it directly using load_dataset. More information in the documentation

Currently to make a dataset from a custom generator you can make a dataset script that can yield the examples. When calling load_dataset("path/to/my/dataset/script") it will iterate through the generator to write all the examples in an arrow file without loading them into memory. Then a Dataset object will be created containing your data that are memory-mapped from your disk. Memory-mapping allows to load the dataset without loading it into memory.

You can find how to write a dataset script here

1 Like

@shpotes we’re you able to solve this problem, if so would you please share the solution code

You can find how to write a dataset script here

new link is https://huggingface.co/docs/datasets/dataset_script

EDIT: you should use .from_generator () now, which doesn’t require to implement a custom class

For those interested, we added .from_generator recently :slight_smile:

docs:

3 Likes

@lhoestq I was wondering where can I find some detailed information about how caching works (in particular about what informs the decision about loading from cache or doing the processing again)?

I am generating datasets for inference on-the-fly during evaluate and predict and I passed the cache_dir option so that the generated datasets are generated only under the main process (so I’m using Trainer's accelerator.main_process_first() context manager during evaluate calls). However, when evaluating at different points I need to generate the dataset again. Am I correct in thinking that I should wipe off the cache/generator in between eval calls to ensure the evaluation does not used previously generated datasets?

You can find the documentation here :slight_smile: