I am a bit confused about the behavior of multi-gpu training using accelerate. I am training gpt2 small on the MNLI dataset using the run_glue.py example script. I expected the following to be equivalent
- Training with 1 GPU, batch size per device 16, gradient accumulation steps 8, for max_steps 200 (green curve in image)
- Training with 8 GPUs, batch size per device 16, gradient accumulation steps 1, for max_steps 200
(orange curve in image)
In both cases I am training with the same constant learning rate. However the single gpu setting (1) seems to perform significantly better than the multi-gpu setting (2). How can I set the multi-gpu training to be “equivalent” to the multi-gpu setting? Is there something I am missing about how these two settings behave? Any help is much appreciated!