Mapping large datasets

danielbubiola · February 10, 2022, 6:48am

I have a large dataset of over 10000 audio data. whenever I tried to map it, it exceeded available memory. I have used batching but still it failed. any better way to do it?

mariosasko · February 10, 2022, 1:17pm

Hi! You can lower memory usage by reducing batch_size in map (default is 1000).

danielbubiola · February 12, 2022, 11:34am

Thanks Mario. I have used that. But that didn’t work for me. any other Ideas?

lhoestq · February 15, 2022, 3:51pm

I think the parameter you’re looking for is the writer_batch_size: it’s the number of rows per write operation for the cache file writer.

Default is currently 1,000. Higher value makes the processing do fewer data flushing, lower value consume less temporary memory while running .map().

Feel free to set both batch_size and writer_batch_size to lower values

cc @mariosasko don’t you think we should have writer_batch_size = batch_size by default ?

mariosasko · February 15, 2022, 4:05pm

@lhoestq Yes, I agree. We can set the default value of writer_batch_size to None in the signature of map, and do:

writer_batch_size = batch_size if writer_batch_size is None else writer_batch_size

in the body.

Topic		Replies	Views
Expected memory usage of Dataset Beginners	1	2775	July 4, 2023
Ideal batch_size and writer_batch_size for datasets 🤗Datasets	1	1629	December 9, 2022
Dataset map during runtime 🤗Datasets	2	1279	September 13, 2023
Working with large datasets - cache issues 🤗Datasets	1	1025	June 1, 2022
Streaming datasets and batched mapping 🤗Datasets	5	2664	January 10, 2022

Mapping large datasets

Related topics