Using a generator for the map function in an iterable dataset

mikesol · December 29, 2023, 5:45am

Hi!

I have an audio dataset, and I’m using the standard load_dataset pattern to get the dataset into memory.

dataset = load_dataset("audiofolder", data_dir="../data", streaming=streaming)

And then I use map with batched=True to do preprocessing. The problem is that the preprocessing explodes the size of the dataset due to the batching. Is it possible to pass a generator to map instead of a function? That’d fix the problem in that a single sample would be emitted whenever next is called instead of making the whole batch in one go.

I know that it’s possible to create an IterableDataset from a generator, and I can do that as a last resort, but the load_dataset is quite nice and saves me from doing a bunch of manual librosa stuff.

Thanks in advance for any tips!

lhoestq · December 30, 2023, 12:52am

You can reduce the batch_size to emit examples small batches a a time. Or simply not use batching

Topic		Replies	Views
Streaming datasets and batched mapping 🤗Datasets	5	2692	January 10, 2022
Streaming batched data 🤗Datasets	4	3941	October 5, 2023
Serially creating a very large dataset using from_generator(), slower each iteration (slows to >2-3s per example!) 🤗Datasets	1	786	May 18, 2023
Dataset map during runtime 🤗Datasets	2	1316	September 13, 2023
GeneratorBasedBuilder gets stuck & consumes all RAM 🤗Datasets	2	790	February 8, 2022

Using a generator for the map function in an iterable dataset

Related topics