I have been conducted multi-node training llama-2 model on 2 distinctive server (2x 2 A100 gpu), but the training time is quite different for each server(Server 1: 3m 55s, Server 2: 8m 35s)
below is the profiling result for each server.
Server 1:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
FullyShardedDataParallel.forward 2.30% 4.656s 127.63% 258.490s 5.784ms 0.000us 0.00% 251.164s 5.620ms 44689
aten::mm 2.35% 4.754s 3.11% 6.294s 56.399us 86.628s 38.84% 89.632s 803.119us 111605
record_param_comms 4.15% 8.411s 4.87% 9.863s 56.484us 72.085s 32.32% 85.420s 489.199us 174612
autograd::engine::evaluate_function: MmBackward0 0.12% 252.987ms 26.91% 54.493s 1.848ms 0.000us 0.00% 83.776s 2.842ms 29480
MmBackward0 0.61% 1.228s 26.78% 54.233s 1.840ms 0.000us 0.00% 83.759s 2.841ms 29480
c10d::_allgather_base_ 0.21% 433.304ms 3.20% 6.482s 98.102us 0.000us 0.00% 68.538s 1.037ms 66069
ncclDevKernel_AllGather_RING_LL(ncclDevComm*, unsign... 0.00% 0.000us 0.00% 0.000us 0.000us 65.671s 29.44% 65.671s 980.450us 66980
aten::linear 0.21% 419.606ms 2.82% 5.715s 87.863us 0.000us 0.00% 59.659s 917.226us 65043
aten::matmul 0.18% 366.029ms 2.45% 4.957s 79.256us 0.000us 0.00% 59.083s 944.646us 62545
ampere_bf16_s16816gemm_bf16_128x256_ldg8_f2f_stages_... 0.00% 0.000us 0.00% 0.000us 0.000us 40.341s 18.09% 40.341s 2.253ms 17905
FullyShardedDataParallel._pre_forward -19.39% -39260507.000us 36.24% 73.386s 1.642ms 0.000us 0.00% 37.472s 838.501us 44689
FullyShardedDataParallel._pre_backward_hook -0.26% -531075.000us 18.14% 36.735s 1.736ms 0.000us 0.00% 32.868s 1.553ms 21164
FullyShardedDataParallel._pre_backward_prefetch -8.80% -17826299.000us 15.95% 32.300s 1.526ms 0.000us 0.00% 31.996s 1.512ms 21164
autograd::engine::evaluate_function: UnsafeViewBackw... 0.23% 474.306ms 17.97% 36.399s 1.235ms 0.000us 0.00% 31.766s 1.078ms 29480
aten::mul 0.92% 1.863s 1.69% 3.426s 40.042us 19.177s 8.60% 25.270s 295.304us 85572
ampere_bf16_s16816gemm_bf16_128x256_ldg8_f2f_stages_... 0.00% 0.000us 0.00% 0.000us 0.000us 14.026s 6.29% 14.026s 2.030ms 6908
ampere_bf16_s16816gemm_bf16_256x128_ldg8_f2f_stages_... 0.00% 0.000us 0.00% 0.000us 0.000us 12.918s 5.79% 12.918s 4.341ms 2976
ampere_bf16_s16816gemm_bf16_256x128_ldg8_f2f_stages_... 0.00% 0.000us 0.00% 0.000us 0.000us 10.880s 4.88% 10.880s 3.804ms 2860
aten::copy_ 0.47% 959.763ms 2.90% 5.872s 64.208us 10.213s 4.58% 10.626s 116.181us 91460
cudaLaunchKernel 1.70% 3.440s 1.71% 3.472s 6.110us 10.148s 4.55% 10.150s 17.863us 568233
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 9.918s 4.45% 9.918s 228.448us 43416
cudaEventQuery 0.35% 716.219ms 0.37% 743.229ms 0.855us 9.232s 4.14% 9.232s 10.619us 869371
autograd::engine::evaluate_function: MulBackward0 0.07% 140.898ms 0.46% 931.662ms 41.356us 0.000us 0.00% 8.959s 397.675us 22528
c10d::allgather_ 0.01% 19.201ms 0.08% 162.460ms 178.332us 0.000us 0.00% 8.768s 9.625ms 911
aten::_to_copy -0.17% -353326.000us 4.20% 8.505s 131.379us 0.000us 0.00% 8.191s 126.528us 64738
aten::add 0.34% 698.129ms 0.37% 757.529ms 21.063us 7.635s 3.42% 8.020s 222.991us 35965
autograd::engine::evaluate_function: torch::autograd... 0.41% 820.610ms 6.27% 12.703s 644.450us 0.000us 0.00% 7.889s 400.205us 19712
FullyShardedDataParallel._post_backward_hook 3.25% 6.576s 5.82% 11.784s 597.832us 0.000us 0.00% 7.847s 398.068us 19712
aten::to 0.10% 212.361ms 4.23% 8.558s 55.795us 0.000us 0.00% 7.811s 50.923us 153383
MulBackward0 0.04% 87.512ms 0.33% 668.249ms 29.663us 0.000us 0.00% 7.489s 332.426us 22528
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 202.527s
Self CUDA time total: 223.061s
Server 2:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
FullyShardedDataParallel.forward -0.43% -1844789.000us 208.30% 885.801s 19.821ms 0.000us 0.00% 500.421s 11.198ms 44689
record_param_comms 1.52% 6.454s 1.68% 7.152s 40.958us 361.939s 83.20% 375.689s 2.152ms 174612
c10d::_allgather_base_ 0.08% 350.456ms 1.09% 4.651s 70.398us 0.000us 0.00% 361.006s 5.464ms 66069
ncclDevKernel_AllGather_RING_LL(ncclDevComm*, unsign... 0.00% 0.000us 0.00% 0.000us 0.000us 350.904s 80.66% 350.904s 5.239ms 66980
FullyShardedDataParallel._pre_forward -69.70% -296420964.000us 76.09% 323.591s 7.241ms 0.000us 0.00% 192.696s 4.312ms 44689
FullyShardedDataParallel._pre_backward_hook 0.35% 1.477s 2.99% 12.705s 600.306us 0.000us 0.00% 163.586s 7.729ms 21164
FullyShardedDataParallel._pre_backward_prefetch 0.02% 76.093ms 2.61% 11.084s 523.720us 0.000us 0.00% 161.722s 7.641ms 21164
autograd::engine::evaluate_function: UnsafeViewBackw... 0.09% 385.364ms 2.93% 12.479s 423.313us 0.000us 0.00% 158.321s 5.370ms 29480
autograd::engine::evaluate_function: MmBackward0 0.05% 199.283ms 39.97% 169.967s 5.766ms 0.000us 0.00% 59.259s 2.010ms 29480
MmBackward0 0.26% 1.098s 39.92% 169.759s 5.758ms 0.000us 0.00% 58.860s 1.997ms 29480
aten::mm 1.96% 8.330s 2.33% 9.912s 88.813us 33.727s 7.75% 47.740s 427.754us 111605
aten::linear 0.11% 460.671ms 2.18% 9.250s 136.681us 0.000us 0.00% 31.575s 466.574us 67675
aten::matmul 0.08% 328.676ms 1.95% 8.293s 132.587us 0.000us 0.00% 30.681s 490.537us 62545
cudaEventQuery 0.07% 296.426ms 0.07% 302.343ms 0.382us 17.182s 3.95% 17.183s 21.728us 790796
sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128... 0.00% 0.000us 0.00% 0.000us 0.000us 16.819s 3.87% 16.819s 450.564us 37329
aten::mul 0.33% 1.403s 0.39% 1.673s 19.612us 11.156s 2.56% 12.922s 151.434us 85328
autograd::engine::evaluate_function: torch::autograd... 0.16% 688.163ms 2.24% 9.544s 484.149us 0.000us 0.00% 12.835s 651.124us 19712
cudaEventRecord 0.03% 124.691ms 0.03% 125.320ms 0.182us 12.804s 2.94% 12.804s 18.544us 690436
FullyShardedDataParallel._post_backward_hook 1.29% 5.474s 2.06% 8.766s 444.721us 0.000us 0.00% 12.694s 643.982us 19712
cudaLaunchKernel 0.40% 1.720s 0.42% 1.778s 3.894us 11.144s 2.56% 11.151s 24.421us 456628
c10d::_reduce_scatter_base_ 0.03% 113.112ms 0.31% 1.303s 66.099us 0.000us 0.00% 8.770s 444.894us 19712
aten::copy_ 0.35% 1.468s 2.26% 9.630s 105.287us 7.562s 1.74% 8.590s 93.922us 91460
cudaStreamWaitEvent 0.03% 135.826ms 0.03% 135.887ms 0.301us 7.911s 1.82% 7.912s 17.507us 451917
cudaFuncSetAttribute 0.01% 26.438ms 0.01% 26.655ms 0.083us 6.952s 1.60% 6.952s 21.665us 320913
aten::_to_copy -0.17% -717446.000us 2.70% 11.486s 177.421us 0.000us 0.00% 6.829s 105.480us 64738
c10d::broadcast_ 0.00% 5.114ms 0.34% 1.452s 2.626ms 0.000us 0.00% 6.630s 11.989ms 553
aten::to 0.16% 691.999ms 2.71% 11.527s 75.149us 0.000us 0.00% 6.174s 40.251us 153383
ncclDevKernel_Broadcast_RING_LL(ncclDevComm*, unsign... 0.00% 0.000us 0.00% 0.000us 0.000us 5.629s 1.29% 5.629s 10.180ms 553
sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize128... 0.00% 0.000us 0.00% 0.000us 0.000us 5.590s 1.29% 5.590s 336.127us 16632
autograd::engine::evaluate_function: MulBackward0 0.02% 88.979ms 0.15% 641.422ms 28.472us 0.000us 0.00% 5.408s 240.064us 22528
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 425.255s
Self CUDA time total: 435.036s
It seems that GPU bandwidth affects training time, but when i try nvbandwidth test(GitHub - NVIDIA/nvbandwidth: A tool for bandwidth measurements on NVIDIA GPUs.), Server 2 is more faster than Server 1.
What makes difference on two server’s training time?
I need your insights. thank you.