Multi-gpu training does not optimize as expected

I am a bit confused about the behavior of multi-gpu training using accelerate. I am training gpt2 small on the MNLI dataset using the run_glue.py example script. I expected the following to be equivalent

  1. Training with 1 GPU, batch size per device 16, gradient accumulation steps 8, for max_steps 200 (green curve in image)
  2. Training with 8 GPUs, batch size per device 16, gradient accumulation steps 1, for max_steps 200
    (orange curve in image)

In both cases I am training with the same constant learning rate. However the single gpu setting (1) seems to perform significantly better than the multi-gpu setting (2). How can I set the multi-gpu training to be “equivalent” to the multi-gpu setting? Is there something I am missing about how these two settings behave? Any help is much appreciated!

condition 1: 16*8 per GPU
condition 2: 16*1 per GPU
seems learning rate must be 8x