I was trying to compare Datasets.map() with multiprocessing (num_cores) enabled and DASK (with multiprocessing). I observed datasets is faster. How can we explain why Datasets is faster when compared to something like DASK or similar libraries?
@lhoestq - Is it because datasets uses arrow internally?
Arrow definitely plays a significant role:
- it’s an in-memory format (no deserialization overhead)
- datasets use memory mapped Arrow tables to get the best I/O performance, while maintaining very low RAM usage
Dask’s performance however may depend on the data format you’re using, which may require extra CPU for compression or extra I/O
@lhoestq what’s the extra benefit we get when using datasets instead of arrow directly to process data?
Arrow is lower level, and is made to be used as a back-end for data processing frameworks.
With datasets
you can easily load, transform, share and train models using your data.