Multiprocessing map taking too much memory footprint

lhoestq · April 3, 2024, 12:37pm

A dataset that comes from memory (e.g. using .from_dict()) doesn’t have a cache file yet, so if you want your map() to write on disk instead of filling up your memory you should pass a cache_file_name to map().

Note that at one point we might allocate a cache automatically to such datasets in memory to align with the general behavior.

Topic		Replies	Views
Deal with large image datasets 🤗Datasets	1	1065	October 22, 2021
How does `datasets.Dataset.map` parallelize data? Beginners	3	3085	August 5, 2024
Map multiprocessing Issue 🤗Datasets	31	17602	July 16, 2024
Working with large datasets - cache issues 🤗Datasets	1	1025	June 1, 2022
Slow processing with map when using deepspeed or fairscale 🤗Datasets	10	3650	June 25, 2021

Multiprocessing map taking too much memory footprint

Related topics