I am using accelerate for multi-gpu training.
The main process computes a boolean condition on whether we need to stop the training (to be precise, whether a certain time limit is exceeded). I only want the main process to compute that condition so that all processes are on the same page.
How can I do that?
Should I broadcast a boolean tensor with accelerator.broadcast once the condition is computed in the main process? Are there alternatives to that approach?
You could use the Accelerator.main_process_first() context manager to do the calculation, followed then by the broadcast to ensure all devices get it if on the main process otherwise do nothing.
Since you’d use main_process_first this also then uses wait_for_everyone so nothing continues until everyone has reached the state of completion.
To complete a bit the answer, here is how to broadcast:
from accelerate.utils import broadcast_object_list
end_of_training = [False]
if accelerator.is_main_process():
end_of_training = [my_condition_on_main_process()]
accelerator.wait_for_everyone()
# This is in place which is why end_of_training needs to be a list
broadcast_object_list(end_of_training)
if end_of_training[0]:
break_loop
I’m not entirely sure accelerator.wait_for_everyone() is needed since the broadcast should wait for all processes to get there, but it doesn’t hurt either.