Multi-gpu training - condition to stop the training computed in the main process - broadcast?

VictorSanh · July 19, 2022, 3:25pm

Hi!

I am using accelerate for multi-gpu training.
The main process computes a boolean condition on whether we need to stop the training (to be precise, whether a certain time limit is exceeded). I only want the main process to compute that condition so that all processes are on the same page.

How can I do that?

Should I broadcast a boolean tensor with accelerator.broadcast once the condition is computed in the main process? Are there alternatives to that approach?

muellerzr · July 19, 2022, 3:46pm

You could use the Accelerator.main_process_first() context manager to do the calculation, followed then by the broadcast to ensure all devices get it if on the main process otherwise do nothing.

Since you’d use main_process_first this also then uses wait_for_everyone so nothing continues until everyone has reached the state of completion.

sgugger · July 19, 2022, 4:13pm

To complete a bit the answer, here is how to broadcast:

from accelerate.utils import broadcast_object_list

end_of_training = [False]
if accelerator.is_main_process():
    end_of_training = [my_condition_on_main_process()]
accelerator.wait_for_everyone()
# This is in place which is why end_of_training needs to be a list
broadcast_object_list(end_of_training)
if end_of_training[0]:
    break_loop

I’m not entirely sure accelerator.wait_for_everyone() is needed since the broadcast should wait for all processes to get there, but it doesn’t hurt either.

VictorSanh · July 19, 2022, 5:02pm

works perfectly, thank you @muellerzr and @sgugger

Topic		Replies	Views
How to use `broadcast` to send tensor from main process 🤗Accelerate	0	283	March 15, 2024
What is the right way to save check point using accelerator while trainining on multiple gpus? 🤗Accelerate	2	1927	January 24, 2024
Multi-node training 🤗Accelerate	2	2985	January 16, 2023
Trainer.train() hangs with multiple GPUs (but GPUs show activity) Beginners	4	852	October 31, 2024
Executing the accelerate script within a child process 🤗Accelerate	0	216	October 18, 2023

Multi-gpu training - condition to stop the training computed in the main process - broadcast?

Related topics