Multi-gpu training - condition to stop the training computed in the main process - broadcast?

Hi!

I am using accelerate for multi-gpu training.
The main process computes a boolean condition on whether we need to stop the training (to be precise, whether a certain time limit is exceeded). I only want the main process to compute that condition so that all processes are on the same page.

How can I do that?

Should I broadcast a boolean tensor with accelerator.broadcast once the condition is computed in the main process? Are there alternatives to that approach?

You could use the Accelerator.main_process_first() context manager to do the calculation, followed then by the broadcast to ensure all devices get it if on the main process otherwise do nothing.

Since you’d use main_process_first this also then uses wait_for_everyone so nothing continues until everyone has reached the state of completion.

1 Like

To complete a bit the answer, here is how to broadcast:

from accelerate.utils import broadcast_object_list

end_of_training = [False]
if accelerator.is_main_process():
    end_of_training = [my_condition_on_main_process()]
accelerator.wait_for_everyone()
# This is in place which is why end_of_training needs to be a list
broadcast_object_list(end_of_training)
if end_of_training[0]:
    break_loop

I’m not entirely sure accelerator.wait_for_everyone() is needed since the broadcast should wait for all processes to get there, but it doesn’t hurt either.

3 Likes

works perfectly, thank you @muellerzr and @sgugger

1 Like