I am a bit confused about the behavior of multi-gpu training using accelerate. I am training gpt2 small on the MNLI dataset using the run_glue.py example script. I expected the following to be equivalent Training with 1 GPU, batch size per device 16, gradient accumulation steps 8, for max_steps 20…

Multi-gpu training does not optimize as expected

buaa42wxy February 26, 2024, 3:44am 2

condition 1: 16*8 per GPU
condition 2: 16*1 per GPU
seems learning rate must be 8x

Topic		Replies	Views
What does "--multi_gpu" do under the hood? (and how to use it) 🤗Accelerate	7	6911	May 31, 2023
Multiple GPUs do not speed up the training 🤗Accelerate	1	3493	January 26, 2022
Learning Rate Scheduler Distributed Training 🤗Accelerate	6	2527	September 5, 2024
Single GPU is faster than multiple GPUs 🤗Accelerate	3	2141	January 31, 2024
Learning rate for the `Trainer` in a multi gpu setup 🤗Transformers	4	715	April 29, 2024