Multi-gpu training does not optimize as expected

nghosh · November 10, 2023, 2:59am

I am a bit confused about the behavior of multi-gpu training using accelerate. I am training gpt2 small on the MNLI dataset using the run_glue.py example script. I expected the following to be equivalent

Training with 1 GPU, batch size per device 16, gradient accumulation steps 8, for max_steps 200 (green curve in image)
Training with 8 GPUs, batch size per device 16, gradient accumulation steps 1, for max_steps 200
(orange curve in image)

In both cases I am training with the same constant learning rate. However the single gpu setting (1) seems to perform significantly better than the multi-gpu setting (2). How can I set the multi-gpu training to be “equivalent” to the multi-gpu setting? Is there something I am missing about how these two settings behave? Any help is much appreciated!

buaa42wxy · February 26, 2024, 3:44am

condition 1: 16*8 per GPU
condition 2: 16*1 per GPU
seems learning rate must be 8x

Topic		Replies	Views
What does "--multi_gpu" do under the hood? (and how to use it) 🤗Accelerate	7	6862	May 31, 2023
Multiple GPUs do not speed up the training 🤗Accelerate	1	3487	January 26, 2022
Using gradient_accumulation_steps does not give the same results 🤗Accelerate	0	524	February 18, 2023
Same number of optimizations steps with 1 GPU or 4 GPUs? 🤗Accelerate	0	340	March 11, 2023
Single GPU is faster than multiple GPUs 🤗Accelerate	3	2113	January 31, 2024

Multi-gpu training does not optimize as expected

Related topics