The map method is slower than pandas’ apply method in some of my tests. I understand that map function can be expedited using multi-processing, but is there anything else that can be done to improve performance?
E.g. using pandas:
import pandas as pd
fields = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
dummy_df.col_c.apply(lambda e: [e.get(c) or 0 for c in fields])
This takes 1.5 seconds to run on my machine.
E.g. using map:
dummy_ds = Dataset.from_pandas(dummy_df)
dummy_ds.map(lambda r: {'col_d': [r['col_c'].get(c) or 0 for c in fields]})
This takes 48.7 seconds to run.