Explain why datasets.map is faster compared to other similar libraries

I was trying to compare Datasets.map() with multiprocessing (num_cores) enabled and DASK (with multiprocessing). I observed datasets is faster. How can we explain why Datasets is faster when compared to something like DASK or similar libraries?

@lhoestq - Is it because datasets uses arrow internally?

Arrow definitely plays a significant role:

  • it’s an in-memory format (no deserialization overhead)
  • datasets use memory mapped Arrow tables to get the best I/O performance, while maintaining very low RAM usage

Dask’s performance however may depend on the data format you’re using, which may require extra CPU for compression or extra I/O

@lhoestq what’s the extra benefit we get when using datasets instead of arrow directly to process data?

Arrow is lower level, and is made to be used as a back-end for data processing frameworks.

With datasets you can easily load, transform, share and train models using your data.