I am using accelerate for multi-gpu training.
The main process computes a boolean condition on whether we need to stop the training (to be precise, whether a certain time limit is exceeded). I only want the main process to compute that condition so that all processes are on the same page.
How can I do that?
Should I broadcast a boolean tensor with accelerator.broadcast once the condition is computed in the main process? Are there alternatives to that approach?
To complete a bit the answer, here is how to broadcast:
from accelerate.utils import broadcast_object_list
end_of_training = [False]
end_of_training = [my_condition_on_main_process()]
# This is in place which is why end_of_training needs to be a list
I’m not entirely sure accelerator.wait_for_everyone() is needed since the broadcast should wait for all processes to get there, but it doesn’t hurt either.