Im training using the trainer class on a multi gpu setup.
I know that when using accelerate (Comparing performance between different device setups), in order to train with the desired learning rate we have to explicitely multiply by the number of gpus.
Is that also the case when using the trainer class?
In the case of warmup steps: should the same be applied? i.e. n_warmup_steps *= n_gpus ?
In the case of a learning rate scheduler: should the same be applied too?
Thanks for answering, so if I pass some lr to either TrainingArgumentslearning_rate or to the Traineroptimizers, backprop actually occurs with lr / n_gpus. Is my understanding correct?
In that case, wouldnt it be less prone to confusion to call it (similarly to the batch size), learning_rate_per_device?
Not necessarily, because it’s a huristic that people recommend to do so, but it’s also recommended to test yourself at your discretion.
What’s really happening is the number of steps increases that we’re stepping the learning rate, so if you want the same LR from situation A to B you should try multiplying the learning rate.
However again: test yourself first. Sometimes it’s not necessary