I have csv files with about 1 million rows containing textual data. I am preprocessing this data and experimenting with both datasets.map and pandas with multiprocessing.
For pandas, I am using number of cores as by batch count ( 1 million/num_cores is batch size) and process them in parallel. Same is being done with Huggingface datasets as well using batch_size = 1million/num_cores. However, huggingface datasets map is slower compared to pandas multiprocessing.
@lhoestq - Is there an ideal batch_size and/or writer_batch_size that I can use to make datasets.map faster than pandas multiprocessing?
datasets processes data on disk by default, which is a bit slower to read and write than pandas which holds the data in RAM. Though you may pass keep_in_memory=True when loading the dataset to also have the data in RAM instead of loading them from disk.
Nevertheless, you can tweak batch_size and writer_batch_size to control how many examples are passed to your processing functions a at time. But this really depends on your resources and the kind of processing you’re doing.
Some functions are faster when batched, but at the same time you don’t always want to give big batches or you may OOM (especially in computer vision or audio processing).