Why is the training time differ?

dmammfl · April 25, 2024, 8:41am

I have been conducted multi-node training llama-2 model on 2 distinctive server (2x 2 A100 gpu), but the training time is quite different for each server(Server 1: 3m 55s, Server 2: 8m 35s)

below is the profiling result for each server.
Server 1:

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                       FullyShardedDataParallel.forward         2.30%        4.656s       127.63%      258.490s       5.784ms       0.000us         0.00%      251.164s       5.620ms         44689  
                                               aten::mm         2.35%        4.754s         3.11%        6.294s      56.399us       86.628s        38.84%       89.632s     803.119us        111605  
                                     record_param_comms         4.15%        8.411s         4.87%        9.863s      56.484us       72.085s        32.32%       85.420s     489.199us        174612  
       autograd::engine::evaluate_function: MmBackward0         0.12%     252.987ms        26.91%       54.493s       1.848ms       0.000us         0.00%       83.776s       2.842ms         29480  
                                            MmBackward0         0.61%        1.228s        26.78%       54.233s       1.840ms       0.000us         0.00%       83.759s       2.841ms         29480  
                                 c10d::_allgather_base_         0.21%     433.304ms         3.20%        6.482s      98.102us       0.000us         0.00%       68.538s       1.037ms         66069  
ncclDevKernel_AllGather_RING_LL(ncclDevComm*, unsign...         0.00%       0.000us         0.00%       0.000us       0.000us       65.671s        29.44%       65.671s     980.450us         66980  
                                           aten::linear         0.21%     419.606ms         2.82%        5.715s      87.863us       0.000us         0.00%       59.659s     917.226us         65043  
                                           aten::matmul         0.18%     366.029ms         2.45%        4.957s      79.256us       0.000us         0.00%       59.083s     944.646us         62545  
ampere_bf16_s16816gemm_bf16_128x256_ldg8_f2f_stages_...         0.00%       0.000us         0.00%       0.000us       0.000us       40.341s        18.09%       40.341s       2.253ms         17905  
                  FullyShardedDataParallel._pre_forward       -19.39%  -39260507.000us        36.24%       73.386s       1.642ms       0.000us         0.00%       37.472s     838.501us         44689  
            FullyShardedDataParallel._pre_backward_hook        -0.26%  -531075.000us        18.14%       36.735s       1.736ms       0.000us         0.00%       32.868s       1.553ms         21164  
        FullyShardedDataParallel._pre_backward_prefetch        -8.80%  -17826299.000us        15.95%       32.300s       1.526ms       0.000us         0.00%       31.996s       1.512ms         21164  
autograd::engine::evaluate_function: UnsafeViewBackw...         0.23%     474.306ms        17.97%       36.399s       1.235ms       0.000us         0.00%       31.766s       1.078ms         29480  
                                              aten::mul         0.92%        1.863s         1.69%        3.426s      40.042us       19.177s         8.60%       25.270s     295.304us         85572  
ampere_bf16_s16816gemm_bf16_128x256_ldg8_f2f_stages_...         0.00%       0.000us         0.00%       0.000us       0.000us       14.026s         6.29%       14.026s       2.030ms          6908  
ampere_bf16_s16816gemm_bf16_256x128_ldg8_f2f_stages_...         0.00%       0.000us         0.00%       0.000us       0.000us       12.918s         5.79%       12.918s       4.341ms          2976  
ampere_bf16_s16816gemm_bf16_256x128_ldg8_f2f_stages_...         0.00%       0.000us         0.00%       0.000us       0.000us       10.880s         4.88%       10.880s       3.804ms          2860  
                                            aten::copy_         0.47%     959.763ms         2.90%        5.872s      64.208us       10.213s         4.58%       10.626s     116.181us         91460  
                                       cudaLaunchKernel         1.70%        3.440s         1.71%        3.472s       6.110us       10.148s         4.55%       10.150s      17.863us        568233  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us        9.918s         4.45%        9.918s     228.448us         43416  
                                         cudaEventQuery         0.35%     716.219ms         0.37%     743.229ms       0.855us        9.232s         4.14%        9.232s      10.619us        869371  
      autograd::engine::evaluate_function: MulBackward0         0.07%     140.898ms         0.46%     931.662ms      41.356us       0.000us         0.00%        8.959s     397.675us         22528  
                                       c10d::allgather_         0.01%      19.201ms         0.08%     162.460ms     178.332us       0.000us         0.00%        8.768s       9.625ms           911  
                                         aten::_to_copy        -0.17%  -353326.000us         4.20%        8.505s     131.379us       0.000us         0.00%        8.191s     126.528us         64738  
                                              aten::add         0.34%     698.129ms         0.37%     757.529ms      21.063us        7.635s         3.42%        8.020s     222.991us         35965  
autograd::engine::evaluate_function: torch::autograd...         0.41%     820.610ms         6.27%       12.703s     644.450us       0.000us         0.00%        7.889s     400.205us         19712  
           FullyShardedDataParallel._post_backward_hook         3.25%        6.576s         5.82%       11.784s     597.832us       0.000us         0.00%        7.847s     398.068us         19712  
                                               aten::to         0.10%     212.361ms         4.23%        8.558s      55.795us       0.000us         0.00%        7.811s      50.923us        153383  
                                           MulBackward0         0.04%      87.512ms         0.33%     668.249ms      29.663us       0.000us         0.00%        7.489s     332.426us         22528  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 202.527s
Self CUDA time total: 223.061s

Server 2:

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                       FullyShardedDataParallel.forward        -0.43%  -1844789.000us       208.30%      885.801s      19.821ms       0.000us         0.00%      500.421s      11.198ms         44689
                                     record_param_comms         1.52%        6.454s         1.68%        7.152s      40.958us      361.939s        83.20%      375.689s       2.152ms        174612
                                 c10d::_allgather_base_         0.08%     350.456ms         1.09%        4.651s      70.398us       0.000us         0.00%      361.006s       5.464ms         66069
ncclDevKernel_AllGather_RING_LL(ncclDevComm*, unsign...         0.00%       0.000us         0.00%       0.000us       0.000us      350.904s        80.66%      350.904s       5.239ms         66980
                  FullyShardedDataParallel._pre_forward       -69.70%  -296420964.000us        76.09%      323.591s       7.241ms       0.000us         0.00%      192.696s       4.312ms         44689
            FullyShardedDataParallel._pre_backward_hook         0.35%        1.477s         2.99%       12.705s     600.306us       0.000us         0.00%      163.586s       7.729ms         21164
        FullyShardedDataParallel._pre_backward_prefetch         0.02%      76.093ms         2.61%       11.084s     523.720us       0.000us         0.00%      161.722s       7.641ms         21164
autograd::engine::evaluate_function: UnsafeViewBackw...         0.09%     385.364ms         2.93%       12.479s     423.313us       0.000us         0.00%      158.321s       5.370ms         29480
       autograd::engine::evaluate_function: MmBackward0         0.05%     199.283ms        39.97%      169.967s       5.766ms       0.000us         0.00%       59.259s       2.010ms         29480
                                            MmBackward0         0.26%        1.098s        39.92%      169.759s       5.758ms       0.000us         0.00%       58.860s       1.997ms         29480
                                               aten::mm         1.96%        8.330s         2.33%        9.912s      88.813us       33.727s         7.75%       47.740s     427.754us        111605
                                           aten::linear         0.11%     460.671ms         2.18%        9.250s     136.681us       0.000us         0.00%       31.575s     466.574us         67675
                                           aten::matmul         0.08%     328.676ms         1.95%        8.293s     132.587us       0.000us         0.00%       30.681s     490.537us         62545
                                         cudaEventQuery         0.07%     296.426ms         0.07%     302.343ms       0.382us       17.182s         3.95%       17.183s      21.728us        790796
sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128...         0.00%       0.000us         0.00%       0.000us       0.000us       16.819s         3.87%       16.819s     450.564us         37329
                                              aten::mul         0.33%        1.403s         0.39%        1.673s      19.612us       11.156s         2.56%       12.922s     151.434us         85328
autograd::engine::evaluate_function: torch::autograd...         0.16%     688.163ms         2.24%        9.544s     484.149us       0.000us         0.00%       12.835s     651.124us         19712
                                        cudaEventRecord         0.03%     124.691ms         0.03%     125.320ms       0.182us       12.804s         2.94%       12.804s      18.544us        690436
           FullyShardedDataParallel._post_backward_hook         1.29%        5.474s         2.06%        8.766s     444.721us       0.000us         0.00%       12.694s     643.982us         19712
                                       cudaLaunchKernel         0.40%        1.720s         0.42%        1.778s       3.894us       11.144s         2.56%       11.151s      24.421us        456628
                            c10d::_reduce_scatter_base_         0.03%     113.112ms         0.31%        1.303s      66.099us       0.000us         0.00%        8.770s     444.894us         19712
                                            aten::copy_         0.35%        1.468s         2.26%        9.630s     105.287us        7.562s         1.74%        8.590s      93.922us         91460
                                    cudaStreamWaitEvent         0.03%     135.826ms         0.03%     135.887ms       0.301us        7.911s         1.82%        7.912s      17.507us        451917
                                   cudaFuncSetAttribute         0.01%      26.438ms         0.01%      26.655ms       0.083us        6.952s         1.60%        6.952s      21.665us        320913
                                         aten::_to_copy        -0.17%  -717446.000us         2.70%       11.486s     177.421us       0.000us         0.00%        6.829s     105.480us         64738
                                       c10d::broadcast_         0.00%       5.114ms         0.34%        1.452s       2.626ms       0.000us         0.00%        6.630s      11.989ms           553
                                               aten::to         0.16%     691.999ms         2.71%       11.527s      75.149us       0.000us         0.00%        6.174s      40.251us        153383
ncclDevKernel_Broadcast_RING_LL(ncclDevComm*, unsign...         0.00%       0.000us         0.00%       0.000us       0.000us        5.629s         1.29%        5.629s      10.180ms           553
sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize128...         0.00%       0.000us         0.00%       0.000us       0.000us        5.590s         1.29%        5.590s     336.127us         16632
      autograd::engine::evaluate_function: MulBackward0         0.02%      88.979ms         0.15%     641.422ms      28.472us       0.000us         0.00%        5.408s     240.064us         22528
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 425.255s
Self CUDA time total: 435.036s

It seems that GPU bandwidth affects training time, but when i try nvbandwidth test(GitHub - NVIDIA/nvbandwidth: A tool for bandwidth measurements on NVIDIA GPUs.), Server 2 is more faster than Server 1.

What makes difference on two server’s training time?
I need your insights. thank you.

JonathanZal · June 25, 2024, 2:32pm

Hi there! I’m facing the same issue. Have you discovered any solutions?

Topic		Replies	Views
Training llama2-13b-16k model with peft on 3 A100 of 80GB is still throwing cuda out of memory 🤗Accelerate	0	790	October 16, 2023
Why does Transformer (LLaMa 3.1-8B) give different logits during inference for the same sample when used with single versus multi gpu prediction? 🤗Accelerate	0	99	September 20, 2024
More GPUs = lower performance? Beginners	1	521	December 31, 2020
GPT-2 Training Speed Unchanged with Different Batch Size & Grad Accumulation Beginners	1	11	June 28, 2025
Single GPU is faster than multiple GPUs 🤗Accelerate	3	1927	January 31, 2024

Why is the training time differ?

Related topics