Hi,
I have csv files with about 1 million rows containing textual data. I am preprocessing this data and experimenting with both datasets.map and pandas with multiprocessing.
For pandas, I am using number of cores as by batch count ( 1 million/num_cores is batch size) and process them in parallel. Same is being done with Huggingface datasets as well using batch_size = 1million/num_cores. However, huggingface datasets map is slower compared to pandas multiprocessing.
@lhoestq - Is there an ideal batch_size and/or writer_batch_size that I can use to make datasets.map faster than pandas multiprocessing?