Troubleshooting help? Everything just hangs

Hi. I set up accelerate config on two machines, the port I chose is open (I can telnet in to it from elsewhere), … ??
Anytime I try to run anything, it just hangs. My code hangs, accelerate test hangs

$ accelerate test

Running:  accelerate-launch --config_file=None /home/ubuntu/envs/kat/lib/python3.9/site-packages/accelerate/test_utils/test_script.py

(nothing else happens on either machine)

,… everything hangs. How to troubleshoot this?

Node 0:

$ accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 2
How many different machines will you use (use more than 1 for multi-node training)? [1]: 2
What is the rank of this machine (from 0 to the number of machines - 1 )? [0]: 
What is the IP address of the machine that will host the main process? 172.31.206.96
What is the port you will use to communicate with the main process? 2346
Do you want to use DeepSpeed? [yes/NO]: 
Do you want to use FullyShardedDataParallel? [yes/NO]: 
How many GPU(s) should be used for distributed training? [1]:16
Do you wish to use FP16 or BF16 (mixed precision)? [NO/fp16/bf16]: bf16

Node 1:

$ accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 2
How many different machines will you use (use more than 1 for multi-node training)? [1]: 2
What is the rank of this machine (from 0 to the number of machines - 1 )? [0]: 1
What is the IP address of the machine that will host the main process? 172.31.206.96
What is the port you will use to communicate with the main process? 2346
Do you want to use DeepSpeed? [yes/NO]: 
Do you want to use FullyShardedDataParallel? [yes/NO]: 
How many GPU(s) should be used for distributed training? [1]:16
Do you wish to use FP16 or BF16 (mixed precision)? [NO/fp16/bf16]: bf16

Tried setting export NCCL_DEBUG=INFO but not seeing any messages printed… As I said, everything just hangs. (Note that after checking ifconfig I set NCCL_SOCKET_IFNAME=ens32)

What does one do in this situation to move forward?

Got some info from export NCCL_DEBUG=INFO. Here’s what the rank 0 process (on host “harmonai2”) has to say:

harmonai2:40680:40680 [0] NCCL INFO Bootstrap : Using ens32:172.31.206.96<0>
harmonai2:40680:40680 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
harmonai2:40680:40680 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.1.4aws
harmonai2:40680:40680 [0] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
harmonai2:40680:40680 [0] NCCL INFO NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
harmonai2:40680:40680 [0] NCCL INFO NET/OFI Selected Provider is efa
harmonai2:40680:40680 [0] NCCL INFO Using network AWS Libfabric
NCCL version 2.10.3+cuda11.5
harmonai2:40684:40684 [4] NCCL INFO Bootstrap : Using ens32:172.31.206.96<0>
harmonai2:40683:40683 [3] NCCL INFO Bootstrap : Using ens32:172.31.206.96<0>
harmonai2:40685:40685 [5] NCCL INFO Bootstrap : Using ens32:172.31.206.96<0>
harmonai2:40684:40684 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
harmonai2:40684:40684 [4] NCCL INFO NET/OFI Using aws-ofi-nccl 1.1.4aws
harmonai2:40684:40684 [4] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
harmonai2:40684:40684 [4] NCCL INFO NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
harmonai2:40683:40683 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
harmonai2:40683:40683 [3] NCCL INFO NET/OFI Using aws-ofi-nccl 1.1.4aws
harmonai2:40683:40683 [3] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
harmonai2:40683:40683 [3] NCCL INFO NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
harmonai2:40685:40685 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
harmonai2:40685:40685 [5] NCCL INFO NET/OFI Using aws-ofi-nccl 1.1.4aws
harmonai2:40685:40685 [5] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
harmonai2:40685:40685 [5] NCCL INFO NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
harmonai2:40686:40686 [6] NCCL INFO Bootstrap : Using ens32:172.31.206.96<0>
harmonai2:40681:40681 [1] NCCL INFO Bootstrap : Using ens32:172.31.206.96<0>
harmonai2:40681:40681 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
harmonai2:40681:40681 [1] NCCL INFO NET/OFI Using aws-ofi-nccl 1.1.4aws
harmonai2:40686:40686 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
harmonai2:40686:40686 [6] NCCL INFO NET/OFI Using aws-ofi-nccl 1.1.4aws
harmonai2:40681:40681 [1] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
harmonai2:40686:40686 [6] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
harmonai2:40681:40681 [1] NCCL INFO NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
harmonai2:40686:40686 [6] NCCL INFO NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
harmonai2:40684:40684 [4] NCCL INFO NET/OFI Selected Provider is efa
harmonai2:40684:40684 [4] NCCL INFO Using network AWS Libfabric
harmonai2:40683:40683 [3] NCCL INFO NET/OFI Selected Provider is efa
harmonai2:40683:40683 [3] NCCL INFO Using network AWS Libfabric
harmonai2:40685:40685 [5] NCCL INFO NET/OFI Selected Provider is efa
harmonai2:40685:40685 [5] NCCL INFO Using network AWS Libfabric
harmonai2:40681:40681 [1] NCCL INFO NET/OFI Selected Provider is efa
harmonai2:40681:40681 [1] NCCL INFO Using network AWS Libfabric
harmonai2:40686:40686 [6] NCCL INFO NET/OFI Selected Provider is efa
harmonai2:40686:40686 [6] NCCL INFO Using network AWS Libfabric
harmonai2:40682:40682 [2] NCCL INFO Bootstrap : Using ens32:172.31.206.96<0>
harmonai2:40682:40682 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
harmonai2:40682:40682 [2] NCCL INFO NET/OFI Using aws-ofi-nccl 1.1.4aws
harmonai2:40682:40682 [2] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
harmonai2:40682:40682 [2] NCCL INFO NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
harmonai2:40682:40682 [2] NCCL INFO NET/OFI Selected Provider is efa
harmonai2:40682:40682 [2] NCCL INFO Using network AWS Libfabric
harmonai2:40687:40687 [7] NCCL INFO Bootstrap : Using ens32:172.31.206.96<0>
harmonai2:40687:40687 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
harmonai2:40687:40687 [7] NCCL INFO NET/OFI Using aws-ofi-nccl 1.1.4aws
harmonai2:40687:40687 [7] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /opt/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
harmonai2:40687:40687 [7] NCCL INFO NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
harmonai2:40687:40687 [7] NCCL INFO NET/OFI Selected Provider is efa
harmonai2:40687:40687 [7] NCCL INFO Using network AWS Libfabric
harmonai2:40687:50365 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] 0/-1/-1->7->6 [2] 0/-1/-1->7->6 [3] 0/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] 0/-1/-1->7->6 [6] 0/-1/-1->7->6 [7] 0/-1/-1->7->6
harmonai2:40687:50365 [7] NCCL INFO Setting affinity for GPU 7 to ffffff00,0000ffff,ff000000
harmonai2:40680:50356 [0] NCCL INFO Channel 00/08 :    0   7   6   5   4   3   2   1   8  15  14  13  12  11  10   9
harmonai2:40680:50356 [0] NCCL INFO Channel 01/08 :    0   3  10  15  14  13  12   9   8  11   2   7   6   5   4   1
harmonai2:40680:50356 [0] NCCL INFO Channel 02/08 :    0   7   6   5  12  11  10   9   8  15  14  13   4   3   2   1
harmonai2:40680:50356 [0] NCCL INFO Channel 03/08 :    0   5   4   7  14  11  10   9   8  13  12  15   6   3   2   1
harmonai2:40680:50356 [0] NCCL INFO Channel 04/08 :    0   7   6   5   4   3   2   1   8  15  14  13  12  11  10   9
harmonai2:40680:50356 [0] NCCL INFO Channel 05/08 :    0   3  10  15  14  13  12   9   8  11   2   7   6   5   4   1
harmonai2:40680:50356 [0] NCCL INFO Channel 06/08 :    0   7   6   5  12  11  10   9   8  15  14  13   4   3   2   1
harmonai2:40680:50356 [0] NCCL INFO Channel 07/08 :    0   5   4   7  14  11  10   9   8  13  12  15   6   3   2   1
harmonai2:40680:50356 [0] NCCL INFO Trees [0] 1/8/-1->0->-1 [1] 1/-1/-1->0->7 [2] 1/-1/-1->0->7 [3] 1/-1/-1->0->7 [4] 1/-1/-1->0->8 [5] 1/-1/-1->0->7 [6] 1/-1/-1->0->7 [7] 1/-1/-1->0->7
harmonai2:40680:50356 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
harmonai2:40681:50360 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0
harmonai2:40681:50360 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff
harmonai2:40682:50362 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/10/-1->2->-1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->10 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1
harmonai2:40682:50362 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00ffffff
harmonai2:40683:50358 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] -1/-1/-1->3->2 [7] 4/-1/-1->3->2
harmonai2:40683:50358 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00ffffff
harmonai2:40684:50357 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/12/-1->4->-1 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->12 [7] 5/-1/-1->4->3
harmonai2:40684:50357 [4] NCCL INFO Setting affinity for GPU 4 to ffffff00,0000ffff,ff000000
harmonai2:40685:50359 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] -1/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] -1/-1/-1->5->4
harmonai2:40685:50359 [5] NCCL INFO Setting affinity for GPU 5 to ffffff00,0000ffff,ff000000
harmonai2:40686:50361 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/14/-1->6->-1 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->14
harmonai2:40686:50361 [6] NCCL INFO Setting affinity for GPU 6 to ffffff00,0000ffff,ff000000
harmonai2:40680:50356 [0] NCCL INFO Channel 01 : 0[101c0] -> 3[201d0] via P2P/IPC/read
harmonai2:40682:50362 [2] NCCL INFO Channel 01 : 2[201c0] -> 7[a01d0] via P2P/IPC/read
harmonai2:40681:50360 [1] NCCL INFO Channel 00 : 1[101d0] -> 8[101c0] [send] via NET/AWS Libfabric/0/GDRDMA
harmonai2:40680:50356 [0] NCCL INFO Channel 05 : 0[101c0] -> 3[201d0] via P2P/IPC/read
harmonai2:40684:50357 [4] NCCL INFO Channel 03 : 4[901c0] -> 7[a01d0] via P2P/IPC/read
harmonai2:40681:50360 [1] NCCL INFO Channel 04 : 1[101d0] -> 8[101c0] [send] via NET/AWS Libfabric/0/GDRDMA
harmonai2:40682:50362 [2] NCCL INFO Channel 05 : 2[201c0] -> 7[a01d0] via P2P/IPC/read
harmonai2:40686:50361 [6] NCCL INFO Channel 03 : 15[a01d0] -> 6[a01c0] [receive] via NET/AWS Libfabric/3/GDRDMA
harmonai2:40684:50357 [4] NCCL INFO Channel 07 : 4[901c0] -> 7[a01d0] via P2P/IPC/read
harmonai2:40683:50358 [3] NCCL INFO Channel 01 : 3[201d0] -> 10[201c0] [send] via NET/AWS Libfabric/1/GDRDMA
harmonai2:40684:50357 [4] NCCL INFO Channel 02 : 13[901d0] -> 4[901c0] [receive] via NET/AWS Libfabric/2/GDRDMA
harmonai2:40680:50356 [0] NCCL INFO Channel 03 : 0[101c0] -> 5[901d0] via P2P/IPC/read
harmonai2:40683:50358 [3] NCCL INFO Channel 05 : 3[201d0] -> 10[201c0] [send] via NET/AWS Libfabric/1/GDRDMA
harmonai2:40680:50356 [0] NCCL INFO Channel 07 : 0[101c0] -> 5[901d0] via P2P/IPC/read
harmonai2:40680:50356 [0] NCCL INFO Channel 00 : 9[101d0] -> 0[101c0] [receive] via NET/AWS Libfabric/0/GDRDMA
harmonai2:40682:50362 [2] NCCL INFO Channel 01 : 11[201d0] -> 2[201c0] [receive] via NET/AWS Libfabric/1/GDRDMA
harmonai2:40685:50359 [5] NCCL INFO Channel 02 : 5[901d0] -> 12[901c0] [send] via NET/AWS Libfabric/2/GDRDMA
harmonai2:40684:50357 [4] NCCL INFO Channel 06 : 13[901d0] -> 4[901c0] [receive] via NET/AWS Libfabric/2/GDRDMA
harmonai2:40685:50359 [5] NCCL INFO Channel 06 : 5[901d0] -> 12[901c0] [send] via NET/AWS Libfabric/2/GDRDMA
harmonai2:40687:50365 [7] NCCL INFO Channel 03 : 7[a01d0] -> 14[a01c0] [send] via NET/AWS Libfabric/3/GDRDMA
harmonai2:40687:50365 [7] NCCL INFO Channel 07 : 7[a01d0] -> 14[a01c0] [send] via NET/AWS Libfabric/3/GDRDMA
harmonai2:40680:50356 [0] NCCL INFO Channel 04 : 9[101d0] -> 0[101c0] [receive] via NET/AWS Libfabric/0/GDRDMA
harmonai2:40682:50362 [2] NCCL INFO Channel 05 : 11[201d0] -> 2[201c0] [receive] via NET/AWS Libfabric/1/GDRDMA
harmonai2:40680:50356 [0] NCCL INFO Channel 00 : 0[101c0] -> 7[a01d0] via P2P/IPC/read
harmonai2:40680:50356 [0] NCCL INFO Channel 02 : 0[101c0] -> 7[a01d0] via P2P/IPC/read
harmonai2:40680:50356 [0] NCCL INFO Channel 04 : 0[101c0] -> 7[a01d0] via P2P/IPC/read
harmonai2:40680:50356 [0] NCCL INFO Channel 06 : 0[101c0] -> 7[a01d0] via P2P/IPC/read
harmonai2:40686:50361 [6] NCCL INFO Channel 07 : 15[a01d0] -> 6[a01c0] [receive] via NET/AWS Libfabric/3/GDRDMA

…Nothing more after that. Just frozen. Eventually I just pressed Ctrl-C.

Hello, can you try the approaches suggested in Multi-node training hangs at accelerator.prepare(model) · Issue #412 · huggingface/accelerate (github.com)?