Learning Rate Scheduler Distributed Training

Apologies if this is a dumb question but I’ve noticed something when using a learning rate scheduler in a single GPU and multi GPU. I have a warmup period of N warmup steps and then a linear decay over the rest of the training steps. However, I’ve noticed that if I print the learning rate of the optimizer after each step in a single vs multi GPU the learning rate is different and seems to act as if the scheduler was called N times instead of one.

For example, I start with a learning rate of 5e-5 with 16,000 warmup steps.

Here is the output for the single GPU case

Single GPU

  0%|                                                                                                                       | 0/200000 [00:00<?, ?it/s]schedule: 5e-05
  0%|                                                                                                           | 1/200000 [00:04<228:23:23,  4.11s/it]
schedule: 5.0003125000000004e-05
  0%|                                                                                                           | 2/200000 [00:05<126:52:17,  2.28s/it]
schedule: 5.000625e-05
  0%|                                                                                                            | 3/200000 [00:06<93:29:20,  1.68s/it]
schedule: 5.0009375e-05
  0%|                                                                                                            | 4/200000 [00:07<78:51:11,  1.42s/it]
schedule: 5.001250000000001e-05
  0%|                                                                                                            | 5/200000 [00:08<70:25:25,  1.27s/it]
schedule: 5.0015625e-05

and in the multi (8) GPU case

0%|                                                                                                                       | 0/200000 [00:00<?, ?it/s]
schedule: 5e-05
  0%|                                                                                                          | 1/200000 [00:18<1001:27:26, 18.03s/it]
schedule: 5.0025e-05
  0%|                                                                                                           | 2/200000 [00:19<459:20:37,  8.27s/it]
schedule: 5.005e-05
  0%|                                                                                                           | 3/200000 [00:21<291:04:23,  5.24s/it]
schedule: 5.0075000000000004e-05
  0%|                                                                                                           | 4/200000 [00:22<211:53:17,  3.81s/it]
schedule: 5.0100000000000005e-05
  0%|                                                                                                           | 5/200000 [00:24<170:05:48,  3.06s/it]
schedule: 5.0125e-05

I was under the impression that in a multi-GPU case, each backward pass only incurs 1 step since the gradients are aggregated onto one GPU. Is this wrong? In a multi-GPU setting does this mean that 1 “step” is really N steps? I assumed that in the multi-GPU case optimizer step is only called once (and handled by Accelereate) but it seems that instead the optimizer is called once for each machine.

Sorry for the naive question, just getting a little hung up on the verbiage and confusing myself a little :upside_down_face:

1 Like

Bump. Having the same question which I think could be interesting to many.

Yes, that’s how accelerate’s schedulers work. Would recommend reading here: Comparing performance between different device setups

Thanks for the link @muellerzr . I must say though, to me at least, this is unexpected behavior. If a user defines a learning rate schedule as a function of step, it seems strange for that function to be influenced by a conceptually-separate hyperparameter like num_processes.

It’s also inconsistent with the philosophy in the doc you linked which states

we leave this up to the user to decide if they wish to scale their learning rate or not.

IMO the user should get to decide how many times they want their learning rate scheduler stepped per step :slight_smile:

For those looking for a workaround, the Accelerator flag step_scheduler_with_optimizer=False should do it, since it follows this conditional branch rather than the one influenced by num_processes.

I think Accelerate is set up this way because it is convenient to not have to modify the outer num_warmup_steps and num_training_steps. Did you manually modify num_warmup_steps and num_training_steps , leading to unexpected results? If you don’t modify num_warmup_steps and num_training_steps , the performance should be normal.

I am not sure to which exact arguments you are referring; I am using a torch.optim.lr_scheduler.LambdaLR with warmup and cooldown phases defined by my own configuration. I am stepping this scheduler once per optimizer step – not 2x or 4x or 8x or <however many GPUs I happen to be using> per optimizer step.

While it is true that the number of GPUs should influence one’s choice of learning rate, warmup and cooldown rates, etc., the operative word here is choice. After all, accelerate doesn’t silently quadruple the learning rate if you decide to train with 4 GPUs! I think the same principle should apply to the learning rate scheduler.