Ideal batch_size and writer_batch_size for datasets

sriniv · December 9, 2022, 4:45am

Hi,

I have csv files with about 1 million rows containing textual data. I am preprocessing this data and experimenting with both datasets.map and pandas with multiprocessing.

For pandas, I am using number of cores as by batch count ( 1 million/num_cores is batch size) and process them in parallel. Same is being done with Huggingface datasets as well using batch_size = 1million/num_cores. However, huggingface datasets map is slower compared to pandas multiprocessing.

@lhoestq - Is there an ideal batch_size and/or writer_batch_size that I can use to make datasets.map faster than pandas multiprocessing?

lhoestq · December 9, 2022, 10:19am

datasets processes data on disk by default, which is a bit slower to read and write than pandas which holds the data in RAM. Though you may pass keep_in_memory=True when loading the dataset to also have the data in RAM instead of loading them from disk.

Nevertheless, you can tweak batch_size and writer_batch_size to control how many examples are passed to your processing functions a at time. But this really depends on your resources and the kind of processing you’re doing.

Some functions are faster when batched, but at the same time you don’t always want to give big batches or you may OOM (especially in computer vision or audio processing).

Topic		Replies	Views
Mapping large datasets 🤗Datasets	4	526	February 15, 2022
Expected memory usage of Dataset Beginners	1	2780	July 4, 2023
Dataset.map() with batching and multiprocessing 🤗Datasets	1	287	March 5, 2024
Multiprocessing and sharding when creating dataset from scratch using loading script 🤗Datasets	2	1621	November 4, 2022
Streaming datasets and batched mapping 🤗Datasets	5	2664	January 10, 2022

Ideal batch_size and writer_batch_size for datasets

Related topics