There is any way to create a dataset from a generator (without it being loaded into memory). Something similar to tf.data.Dataset.from_generator
The datasets are not completely read into memory so you should not have to worry about memory usage. It’s mostly fast on-disk access thanks to memory mapping.
We don’t have a .from_generator
method yet but this is something we may add !
If you want to generate a dataset from text/json/csv files, then you can do it directly using load_dataset
. More information in the documentation
Currently to make a dataset from a custom generator you can make a dataset script that can yield
the examples. When calling load_dataset("path/to/my/dataset/script")
it will iterate through the generator to write all the examples in an arrow file without loading them into memory. Then a Dataset
object will be created containing your data that are memory-mapped from your disk. Memory-mapping allows to load the dataset without loading it into memory.
You can find how to write a dataset script here
@shpotes we’re you able to solve this problem, if so would you please share the solution code
You can find how to write a dataset script here
new link is https://huggingface.co/docs/datasets/dataset_script
EDIT: you should use .from_generator ()
now, which doesn’t require to implement a custom class
For those interested, we added .from_generator
recently
docs:
- for Dataset: Main classes
- For IterableDataset: Main classes
@lhoestq I was wondering where can I find some detailed information about how caching works (in particular about what informs the decision about loading from cache or doing the processing again)?
I am generating datasets for inference on-the-fly during evaluate
and predict
and I passed the cache_dir
option so that the generated datasets are generated only under the main process (so I’m using Trainer'
s accelerator.main_process_first()
context manager during evaluate
calls). However, when evaluating at different points I need to generate the dataset again. Am I correct in thinking that I should wipe off the cache/generator
in between eval calls to ensure the evaluation does not used previously generated datasets?
You can find the documentation here