Multiple GPUs do not speed up the training

ezio98 · October 28, 2021, 11:28am

I am trying to train the Bert-base-uncased model on Nvidia 3080. However, the strange thing is, the time spent on one step grows linearly with the number of GPU. For example, when I maintain the same batch size, if one step needs 2s/ite on single GPU, the two GPUs need around 4s/ite. Although I know some time may spent on the synchronization, I don’t think it counts too much. As a result, the total time using multiple GPUs is similar to single GPU, which looks like the GPUs run one by one. I directly run the sample code provided on this link and the problem still occurs. BTW, I have run the transformers.trainer using multiple GPUs on this machine, and the distributed training works.

The CUDA version shown by nvidia-smi is 11.4 and the environment is:

transformers version: 4.11.3
Platform: Linux-5.11.0-38-generic-x86_64-with-debian-bullseye-sid
Python version: 3.7.6
PyTorch version (GPU?): 1.9.0+cu111 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

The accelerate config is:

In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 2
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Do you want to use DeepSpeed? [yes/NO]: no
How many processes in total will you use? [1]: 4
Do you wish to use FP16 (mixed precision)? [yes/NO]: no

The relevant outputs are:

Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal pe
rformance in your application as needed.
*****************************************
10/28/2021 16:10:50 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 0
Local process index: 0
Device: cuda:0
Use FP16 precision: False

10/28/2021 16:10:50 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 3
Local process index: 3
Device: cuda:3
Use FP16 precision: False

10/28/2021 16:10:50 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 2
Local process index: 2
Device: cuda:2
Use FP16 precision: False

10/28/2021 16:10:50 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 1
Local process index: 1
Device: cuda:1
Use FP16 precision: False

.........
# and in the training loop, these four lines occur:
10/28/2021 16:11:45 - INFO - root - Reducer buckets have been rebuilt in this iteration.
10/28/2021 16:11:45 - INFO - root - Reducer buckets have been rebuilt in this iteration.
10/28/2021 16:11:45 - INFO - root - Reducer buckets have been rebuilt in this iteration.
10/28/2021 16:11:45 - INFO - root - Reducer buckets have been rebuilt in this iteration.

bobbydylan · January 26, 2022, 10:16pm

@ezio98 did you ever figure this out?

Topic		Replies	Views
Model Parallelism, how to parallelize transformer? Beginners	3	12711	June 18, 2021
Trainer is not using multiple GPUs in the DP setup Beginners	0	815	April 9, 2023
Dataloader fetches slowly using accelerator for distributed training 🤗Accelerate	0	1203	October 29, 2021
Using 3 GPUs for training with Trainer() of transformers 🤗Transformers	2	2291	October 18, 2023
Multi-GPU is slower than single GPU when running examples 🤗Accelerate	2	448	July 24, 2024

Multiple GPUs do not speed up the training

Related topics