Use existing Dataset with a generator

mariosasko · April 13, 2023, 11:14pm

I forgot to mention that I was trying to avoid mapping the entire dataset at once because of its large size (+170GB). That’s what I initially did, but the map takes 2-3 hours at the start of EVERY training run (and inflates everything to +700GB), which is infeasible for me.

You can use multiprocessing (the num_proc parameter) and disable image decoding with ds= ds.cast_column("image", Image(decode=False) (turn it back on later with ds = ds.cast_column("image", Image()) to make the processing faster. Also, it’s best to run this transform on the “raw” dataset to allow reloading from the cache on later runs.

The only on-the-fly, one-to-many-examples option is the one mentioned above, right?

Correct!

Topic		Replies	Views
Making multiple samples from single samples using HuggingFace Datasets 🤗Datasets	6	2630	March 3, 2024
Weird example of batching in Dataset.map document 🤗Datasets	4	1039	September 4, 2023
Mapping 1 multi-element column of a dataset to multi row dataset with 1 element per row, duplicating other features 🤗Datasets	6	2515	November 4, 2022
Using a generator for the map function in an iterable dataset Beginners	1	416	December 30, 2023
Homogeneous batches from list of IterableDatasets 🤗Datasets	6	61	October 23, 2024

Use existing Dataset with a generator

Related topics