Explain why datasets.map is faster compared to other similar libraries

sriniv · August 2, 2022, 7:41am

I was trying to compare Datasets.map() with multiprocessing (num_cores) enabled and DASK (with multiprocessing). I observed datasets is faster. How can we explain why Datasets is faster when compared to something like DASK or similar libraries?

sriniv · August 17, 2022, 8:40am

@lhoestq - Is it because datasets uses arrow internally?

lhoestq · August 17, 2022, 10:37am

Arrow definitely plays a significant role:

it’s an in-memory format (no deserialization overhead)
datasets use memory mapped Arrow tables to get the best I/O performance, while maintaining very low RAM usage

Dask’s performance however may depend on the data format you’re using, which may require extra CPU for compression or extra I/O

sriniv · September 6, 2022, 8:27am

@lhoestq what’s the extra benefit we get when using datasets instead of arrow directly to process data?

lhoestq · September 6, 2022, 12:06pm

Arrow is lower level, and is made to be used as a back-end for data processing frameworks.

With datasets you can easily load, transform, share and train models using your data.

Topic		Replies	Views
Image dataset performance when using map 🤗Datasets	0	120	June 24, 2024
Dataset map function takes forever to run! 🤗Datasets	16	6596	August 15, 2024
Datasets map is slower than pandas apply 🤗Datasets	0	1114	April 23, 2022
Why dataset iterating is so slow? 🤗Datasets	1	2048	January 3, 2023
Filtering performance 🤗Datasets	5	2008	March 5, 2025

Explain why datasets.map is faster compared to other similar libraries

Related topics