Learning Rate Scheduler Distributed Training

Apologies if this is a dumb question but I’ve noticed something when using a learning rate scheduler in a single GPU and multi GPU. I have a warmup period of N warmup steps and then a linear decay over the rest of the training steps. However, I’ve noticed that if I print the learning rate of the optimizer after each step in a single vs multi GPU the learning rate is different and seems to act as if the scheduler was called N times instead of one.

For example, I start with a learning rate of 5e-5 with 16,000 warmup steps.

Here is the output for the single GPU case

Single GPU

  0%|                                                                                                                       | 0/200000 [00:00<?, ?it/s]schedule: 5e-05
  0%|                                                                                                           | 1/200000 [00:04<228:23:23,  4.11s/it]
schedule: 5.0003125000000004e-05
  0%|                                                                                                           | 2/200000 [00:05<126:52:17,  2.28s/it]
schedule: 5.000625e-05
  0%|                                                                                                            | 3/200000 [00:06<93:29:20,  1.68s/it]
schedule: 5.0009375e-05
  0%|                                                                                                            | 4/200000 [00:07<78:51:11,  1.42s/it]
schedule: 5.001250000000001e-05
  0%|                                                                                                            | 5/200000 [00:08<70:25:25,  1.27s/it]
schedule: 5.0015625e-05

and in the multi (8) GPU case

0%|                                                                                                                       | 0/200000 [00:00<?, ?it/s]
schedule: 5e-05
  0%|                                                                                                          | 1/200000 [00:18<1001:27:26, 18.03s/it]
schedule: 5.0025e-05
  0%|                                                                                                           | 2/200000 [00:19<459:20:37,  8.27s/it]
schedule: 5.005e-05
  0%|                                                                                                           | 3/200000 [00:21<291:04:23,  5.24s/it]
schedule: 5.0075000000000004e-05
  0%|                                                                                                           | 4/200000 [00:22<211:53:17,  3.81s/it]
schedule: 5.0100000000000005e-05
  0%|                                                                                                           | 5/200000 [00:24<170:05:48,  3.06s/it]
schedule: 5.0125e-05

I was under the impression that in a multi-GPU case, each backward pass only incurs 1 step since the gradients are aggregated onto one GPU. Is this wrong? In a multi-GPU setting does this mean that 1 “step” is really N steps? I assumed that in the multi-GPU case optimizer step is only called once (and handled by Accelereate) but it seems that instead the optimizer is called once for each machine.

Sorry for the naive question, just getting a little hung up on the verbiage and confusing myself a little :upside_down_face: