Multiple GPUs do not speed up the training

I am trying to train the Bert-base-uncased model on Nvidia 3080. However, the strange thing is, the time spent on one step grows linearly with the number of GPU. For example, when I maintain the same batch size, if one step needs 2s/ite on single GPU, the two GPUs need around 4s/ite. Although I know some time may spent on the synchronization, I don’t think it counts too much. As a result, the total time using multiple GPUs is similar to single GPU, which looks like the GPUs run one by one. I directly run the sample code provided on this link and the problem still occurs. BTW, I have run the transformers.trainer using multiple GPUs on this machine, and the distributed training works.

The CUDA version shown by nvidia-smi is 11.4 and the environment is:

  • transformers version: 4.11.3
  • Platform: Linux-5.11.0-38-generic-x86_64-with-debian-bullseye-sid
  • Python version: 3.7.6
  • PyTorch version (GPU?): 1.9.0+cu111 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

The accelerate config is:

In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 2
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Do you want to use DeepSpeed? [yes/NO]: no
How many processes in total will you use? [1]: 4
Do you wish to use FP16 (mixed precision)? [yes/NO]: no

The relevant outputs are:

Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal pe
rformance in your application as needed.
*****************************************
10/28/2021 16:10:50 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 0
Local process index: 0
Device: cuda:0
Use FP16 precision: False

10/28/2021 16:10:50 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 3
Local process index: 3
Device: cuda:3
Use FP16 precision: False

10/28/2021 16:10:50 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 2
Local process index: 2
Device: cuda:2
Use FP16 precision: False

10/28/2021 16:10:50 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 1
Local process index: 1
Device: cuda:1
Use FP16 precision: False

.........
# and in the training loop, these four lines occur:
10/28/2021 16:11:45 - INFO - root - Reducer buckets have been rebuilt in this iteration.
10/28/2021 16:11:45 - INFO - root - Reducer buckets have been rebuilt in this iteration.
10/28/2021 16:11:45 - INFO - root - Reducer buckets have been rebuilt in this iteration.
10/28/2021 16:11:45 - INFO - root - Reducer buckets have been rebuilt in this iteration.