Datasets map is slower than pandas apply

vishalrao · April 23, 2022, 1:22am

The map method is slower than pandas’ apply method in some of my tests. I understand that map function can be expedited using multi-processing, but is there anything else that can be done to improve performance?

E.g. using pandas:

import pandas as pd

fields = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']

dummy_df.col_c.apply(lambda e: [e.get(c) or 0 for c in fields])

This takes 1.5 seconds to run on my machine.

E.g. using map:

dummy_ds = Dataset.from_pandas(dummy_df)
dummy_ds.map(lambda r: {'col_d': [r['col_c'].get(c) or 0 for c in fields]})

This takes 48.7 seconds to run.

Topic		Replies	Views
Querying column is slow for datasets with indices mapping 🤗Datasets	3	1483	May 17, 2021
Index retrieval speed varies considerably with dataset size 🤗Datasets	2	862	May 9, 2022
Explain why datasets.map is faster compared to other similar libraries 🤗Datasets	4	880	September 6, 2022
When using Dataset.map to tokenize a dataset, the speed slows down as the progress approaches 100% 🤗Datasets	3	884	December 23, 2024
Filtering performance 🤗Datasets	5	2009	March 5, 2025

Datasets map is slower than pandas apply

Related topics