Create a dataset from generator

shpotes · January 10, 2021, 12:08pm

There is any way to create a dataset from a generator (without it being loaded into memory). Something similar to tf.data.Dataset.from_generator

BramVanroy · January 10, 2021, 2:26pm

The datasets are not completely read into memory so you should not have to worry about memory usage. It’s mostly fast on-disk access thanks to memory mapping.

lhoestq · January 19, 2021, 3:00pm

We don’t have a .from_generator method yet but this is something we may add !

If you want to generate a dataset from text/json/csv files, then you can do it directly using load_dataset. More information in the documentation

Currently to make a dataset from a custom generator you can make a dataset script that can yield the examples. When calling load_dataset("path/to/my/dataset/script") it will iterate through the generator to write all the examples in an arrow file without loading them into memory. Then a Dataset object will be created containing your data that are memory-mapped from your disk. Memory-mapping allows to load the dataset without loading it into memory.

You can find how to write a dataset script here

StephennFernandes · May 23, 2022, 12:36am

@shpotes we’re you able to solve this problem, if so would you please share the solution code

lhoestq · June 27, 2022, 2:09pm

You can find how to write a dataset script here

new link is https://huggingface.co/docs/datasets/dataset_script

EDIT: you should use .from_generator () now, which doesn’t require to implement a custom class

lhoestq · November 23, 2022, 10:24am

For those interested, we added .from_generator recently

docs:

for Dataset: Main classes
For IterableDataset: Main classes

deathcrush · January 15, 2024, 11:49am

@lhoestq I was wondering where can I find some detailed information about how caching works (in particular about what informs the decision about loading from cache or doing the processing again)?

I am generating datasets for inference on-the-fly during evaluate and predict and I passed the cache_dir option so that the generated datasets are generated only under the main process (so I’m using Trainer's accelerator.main_process_first() context manager during evaluate calls). However, when evaluating at different points I need to generate the dataset again. Am I correct in thinking that I should wipe off the cache/generator in between eval calls to ensure the evaluation does not used previously generated datasets?

lhoestq · January 30, 2024, 2:43pm

You can find the documentation here

Topic		Replies	Views
Pipeline with custom dataset tokenizer: when to save/load manually 🤗Datasets	18	5664	September 18, 2020
GeneratorBasedBuilder gets stuck & consumes all RAM 🤗Datasets	2	790	February 8, 2022
How to create a new large Dataset on disk? 🤗Datasets	10	3355	July 6, 2022
Serially creating a very large dataset using from_generator(), slower each iteration (slows to >2-3s per example!) 🤗Datasets	1	787	May 18, 2023
Caching a dataset with map() when loaded with from_dict() 🤗Datasets	3	2746	March 22, 2023

Create a dataset from generator

Related topics