datasets.Dataset.map() idle processes when multiprocessing

roccofortuna · December 19, 2022, 12:49pm

I’m running datasets.Dataset.map() with num_proc=64, and after a while the cpu utilization falls far below 100% (3.16%).
This suggests workers are assigned a list of jobs at the beginning, leaving them idle when they’re done with that list, instead of taking on one of the remaining jobs on demand.

Is my intuition correct? And if so, what’s the thought behind this design choice? Wouldn’t it be more optimal to keep a jobs pool available to all workers so they all work until there’s no more jobs left?

Thank you

mariosasko · December 20, 2022, 4:44pm

Hi! When num_proc > 1, map splits the dataset into num_proc shards, each of which is mapped to one of the num_proc workers. So in your case, this means that some workers finished processing their shards earlier than others.

roccofortuna · December 20, 2022, 5:57pm

Thanks, just like I thought.

So my question then is: what’s the thought behind this design choice? Surely implementing a multiprocessing queue of rows/batches to take on would prevent processes from going idle due to the workload not being evenly distributed across shards?

roccofortuna · December 20, 2022, 6:00pm

Also, is there any guarantee on the way the dataset is currently sharded? Is the dataset simply split in consecutive contiguous shards of size~len(data)/num_proc where worker 0 has from 0 to len(data)/num_proc, worker 1 has from len(data)/num_proc to len(data)*2/num_proc and so on?

EDIT: looks like it’s contiguous, from source:
self.shard(num_shards=num_proc, index=rank, contiguous=True, keep_in_memory=keep_in_memory)

mariosasko · December 21, 2022, 1:57pm

You can shuffle the dataset before map to distribute the workload evenly.

So my question then is: what’s the thought behind this design choice? Surely implementing a multiprocessing queue of rows/batches to take on would prevent processes from going idle due to the workload not being evenly distributed across shards?

We can consider this, but this would definitely make our code more complex and harder to maintain.

roccofortuna · December 22, 2022, 3:39pm

I see, thanks! Shuffling would probably make things better, but I’d still cause idle workers by the end if the variance of workload between rows is high. I would be happy to contribute, given I’m working on a custom multiprocessing solution as a workaround to this behavior.
I’d appreciate if you could point me at what needs attention, or explain what your concerns are around complexity/maintainability.

mariosasko · December 22, 2022, 6:13pm

You are more than welcome to contribute. Feel free to open an issue/PR if you need some help/pointers.

Unless we drop Pool in the multiprocess map, there will always be some idle workers by the end since Pool has to maintain the specified number of workers (and there is no public API to change this). But replacing Pool with some other logic would most likely make our code more complex, which is probably not worth it unless this leads to significant savings performance-wise.

Topic		Replies	Views
How does `datasets.Dataset.map` parallelize data? Beginners	3	3112	August 5, 2024
Map with num_proc over 1 fails 🤗Datasets	1	166	April 24, 2024
Dataset.map() with batching and multiprocessing 🤗Datasets	1	291	March 5, 2024
Using num_proc>1 in Dataset.map hangs 🤗Datasets	8	4007	August 19, 2024
Dataset map function takes forever to run! 🤗Datasets	16	6707	August 15, 2024

datasets.Dataset.map() idle processes when multiprocessing

Related topics