Learning rate for the `Trainer` in a multi gpu setup

Im training using the trainer class on a multi gpu setup.
I know that when using accelerate (Comparing performance between different device setups), in order to train with the desired learning rate we have to explicitely multiply by the number of gpus.

  • Is that also the case when using the trainer class?
  • In the case of warmup steps: should the same be applied? i.e. n_warmup_steps *= n_gpus ?
  • In the case of a learning rate scheduler: should the same be applied too?

Yes.

Wrt warmup steps, it may need to be, Im’ unsure off the top of my head

Thanks for answering, so if I pass some lr to either TrainingArguments learning_rate or to the Trainer optimizers, backprop actually occurs with lr / n_gpus. Is my understanding correct?
In that case, wouldnt it be less prone to confusion to call it (similarly to the batch size), learning_rate_per_device?

Not necessarily, because it’s a huristic that people recommend to do so, but it’s also recommended to test yourself at your discretion.

What’s really happening is the number of steps increases that we’re stepping the learning rate, so if you want the same LR from situation A to B you should try multiplying the learning rate.

However again: test yourself first. Sometimes it’s not necessary

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.