Ideal batch_size and writer_batch_size for datasets

sriniv · December 9, 2022, 4:45am

Hi,

I have csv files with about 1 million rows containing textual data. I am preprocessing this data and experimenting with both datasets.map and pandas with multiprocessing.

For pandas, I am using number of cores as by batch count ( 1 million/num_cores is batch size) and process them in parallel. Same is being done with Huggingface datasets as well using batch_size = 1million/num_cores. However, huggingface datasets map is slower compared to pandas multiprocessing.

@lhoestq - Is there an ideal batch_size and/or writer_batch_size that I can use to make datasets.map faster than pandas multiprocessing?

Topic		Replies	Views
Dataset.map() with batching and multiprocessing 🤗Datasets	1	304	March 5, 2024
Progress bar of dataset.map with num_proc>1 hangs 🤗Datasets	2	1313	December 6, 2023
Mapping large datasets 🤗Datasets	4	529	February 15, 2022
Getting correct length via DataLoader and speed 🤗Datasets	4	457	April 5, 2024
Dataset map during runtime 🤗Datasets	2	1305	September 13, 2023

Ideal batch_size and writer_batch_size for datasets

Related topics