Accelerate.prepare hang on single machine multiple gpu

When I run:

import torch.nn as nn
from accelerate import Accelerator

if __name__ == "__main__":
    accelerator = Accelerator()
    model = nn.Conv2d(10, 20, 3, 1, 1)
    print("prepare")
    model = accelerator.prepare(model)
    print("done")

with config:

compute_environment: LOCAL_MACHINE

distributed_type: MULTI_GPU

downcast_bf16: 'no'

gpu_ids: all

machine_rank: 0

main_training_function: main

mixed_precision: 'no'

num_machines: 1

num_processes: 3

rdzv_backend: static

same_network: true

tpu_env: []

tpu_use_cluster: false

tpu_use_sudo: false

use_cpu: false

“done” never gets printed.
“done” gets printed if I set num_process to 1

But on some other server using the same config, “done” gets printed

Debug message for the one that’s not working:

prepare                                                                                                                     
prepare                                                                                                                     
prepare
osprey2:1465885:1465885 [0] NCCL INFO Bootstrap : Using eno1:128.174.136.28<0>
osprey2:1465885:1465885 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
osprey2:1465885:1465885 [0] NCCL INFO cudaDriverVersion 11070
NCCL version 2.14.3+cuda11.7
osprey2:1465885:1466929 [0] NCCL INFO Failed to open libibverbs.so[.1]
osprey2:1465885:1466929 [0] NCCL INFO NET/Socket : Using [0]eno1:128.174.136.28<0>
osprey2:1465885:1466929 [0] NCCL INFO Using network Socket
osprey2:1465891:1465891 [2] NCCL INFO cudaDriverVersion 11070
osprey2:1465889:1465889 [1] NCCL INFO cudaDriverVersion 11070
osprey2:1465889:1465889 [1] NCCL INFO Bootstrap : Using eno1:128.174.136.28<0>
osprey2:1465889:1465889 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
osprey2:1465889:1466930 [1] NCCL INFO Failed to open libibverbs.so[.1]
osprey2:1465889:1466930 [1] NCCL INFO NET/Socket : Using [0]eno1:128.174.136.28<0>
osprey2:1465889:1466930 [1] NCCL INFO Using network Socket
osprey2:1465891:1465891 [2] NCCL INFO Bootstrap : Using eno1:128.174.136.28<0>
osprey2:1465891:1465891 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
osprey2:1465891:1466931 [2] NCCL INFO Failed to open libibverbs.so[.1]
osprey2:1465891:1466931 [2] NCCL INFO NET/Socket : Using [0]eno1:128.174.136.28<0>
osprey2:1465891:1466931 [2] NCCL INFO Using network Socket
osprey2:1465889:1466930 [1] NCCL INFO Setting affinity for GPU 1 to ffff0000,ffff0000
osprey2:1465891:1466931 [2] NCCL INFO Setting affinity for GPU 2 to ffff0000,ffff0000
osprey2:1465885:1466929 [0] NCCL INFO Setting affinity for GPU 0 to ffff,0000ffff
osprey2:1465891:1466931 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1 [2] -1/-1/-1->2->1 [3] -1/-1/-1->2->1
osprey2:1465889:1466930 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0
osprey2:1465885:1466929 [0] NCCL INFO Channel 00/04 :    0   1   2
osprey2:1465885:1466929 [0] NCCL INFO Channel 01/04 :    0   1   2
osprey2:1465885:1466929 [0] NCCL INFO Channel 02/04 :    0   1   2
osprey2:1465885:1466929 [0] NCCL INFO Channel 03/04 :    0   1   2
osprey2:1465885:1466929 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
osprey2:1465885:1466929 [0] NCCL INFO Channel 00/0 : 0[21000] -> 1[81000] via P2P/IPC
osprey2:1465889:1466930 [1] NCCL INFO Channel 00/0 : 1[81000] -> 2[e2000] via P2P/IPC
osprey2:1465891:1466931 [2] NCCL INFO Channel 00/0 : 2[e2000] -> 0[21000] via P2P/IPC
osprey2:1465885:1466929 [0] NCCL INFO Channel 01/0 : 0[21000] -> 1[81000] via P2P/IPC
osprey2:1465889:1466930 [1] NCCL INFO Channel 01/0 : 1[81000] -> 2[e2000] via P2P/IPC
osprey2:1465891:1466931 [2] NCCL INFO Channel 01/0 : 2[e2000] -> 0[21000] via P2P/IPC
osprey2:1465885:1466929 [0] NCCL INFO Channel 02/0 : 0[21000] -> 1[81000] via P2P/IPC
osprey2:1465889:1466930 [1] NCCL INFO Channel 02/0 : 1[81000] -> 2[e2000] via P2P/IPC
osprey2:1465891:1466931 [2] NCCL INFO Channel 02/0 : 2[e2000] -> 0[21000] via P2P/IPC
osprey2:1465885:1466929 [0] NCCL INFO Channel 03/0 : 0[21000] -> 1[81000] via P2P/IPC
osprey2:1465889:1466930 [1] NCCL INFO Channel 03/0 : 1[81000] -> 2[e2000] via P2P/IPC
osprey2:1465891:1466931 [2] NCCL INFO Channel 03/0 : 2[e2000] -> 0[21000] via P2P/IPC
osprey2:1465889:1466930 [1] NCCL INFO Connected all rings
osprey2:1465891:1466931 [2] NCCL INFO Connected all rings
osprey2:1465891:1466931 [2] NCCL INFO Channel 00/0 : 2[e2000] -> 1[81000] via P2P/IPC
osprey2:1465885:1466929 [0] NCCL INFO Connected all rings
osprey2:1465891:1466931 [2] NCCL INFO Channel 01/0 : 2[e2000] -> 1[81000] via P2P/IPC
osprey2:1465891:1466931 [2] NCCL INFO Channel 02/0 : 2[e2000] -> 1[81000] via P2P/IPC
osprey2:1465891:1466931 [2] NCCL INFO Channel 03/0 : 2[e2000] -> 1[81000] via P2P/IPC
osprey2:1465889:1466930 [1] NCCL INFO Channel 00/0 : 1[81000] -> 0[21000] via P2P/IPC
osprey2:1465889:1466930 [1] NCCL INFO Channel 01/0 : 1[81000] -> 0[21000] via P2P/IPC
osprey2:1465889:1466930 [1] NCCL INFO Channel 02/0 : 1[81000] -> 0[21000] via P2P/IPC
osprey2:1465889:1466930 [1] NCCL INFO Channel 03/0 : 1[81000] -> 0[21000] via P2P/IPC
osprey2:1465891:1466931 [2] NCCL INFO Connected all trees
osprey2:1465891:1466931 [2] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
osprey2:1465891:1466931 [2] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
osprey2:1465889:1466930 [1] NCCL INFO Connected all trees
osprey2:1465889:1466930 [1] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
osprey2:1465889:1466930 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
osprey2:1465885:1466929 [0] NCCL INFO Connected all trees
osprey2:1465885:1466929 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
osprey2:1465885:1466929 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
osprey2:1465885:1466929 [0] NCCL INFO comm 0x3d1af060 rank 0 nranks 3 cudaDev 0 busId 21000 - Init COMPLETE
osprey2:1465891:1466931 [2] NCCL INFO comm 0x3de99e20 rank 2 nranks 3 cudaDev 2 busId e2000 - Init COMPLETE
osprey2:1465889:1466930 [1] NCCL INFO comm 0x3bd29b10 rank 1 nranks 3 cudaDev 1 busId 81000 - Init COMPLETE

debug message for the one working:

prepare

prepare

prepare

owl:1514262:1514262 [0] NCCL INFO Bootstrap : Using eno1np0:172.22.224.10<0>

owl:1514262:1514262 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

owl:1514262:1514262 [0] NCCL INFO cudaDriverVersion 12000

NCCL version 2.14.3+cuda11.7

owl:1514262:1514308 [0] NCCL INFO Failed to open libibverbs.so[.1]

owl:1514262:1514308 [0] NCCL INFO NET/Socket : Using [0]eno1np0:172.22.224.10<0>

owl:1514262:1514308 [0] NCCL INFO Using network Socket

owl:1514264:1514264 [2] NCCL INFO cudaDriverVersion 12000

owl:1514263:1514263 [1] NCCL INFO cudaDriverVersion 12000

owl:1514264:1514264 [2] NCCL INFO Bootstrap : Using eno1np0:172.22.224.10<0>

owl:1514264:1514264 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

owl:1514264:1514309 [2] NCCL INFO Failed to open libibverbs.so[.1]

owl:1514264:1514309 [2] NCCL INFO NET/Socket : Using [0]eno1np0:172.22.224.10<0>

owl:1514264:1514309 [2] NCCL INFO Using network Socket

owl:1514263:1514263 [1] NCCL INFO Bootstrap : Using eno1np0:172.22.224.10<0>

owl:1514263:1514263 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

owl:1514263:1514310 [1] NCCL INFO Failed to open libibverbs.so[.1]

owl:1514263:1514310 [1] NCCL INFO NET/Socket : Using [0]eno1np0:172.22.224.10<0>

owl:1514263:1514310 [1] NCCL INFO Using network Socket

owl:1514263:1514310 [1] NCCL INFO Setting affinity for GPU 1 to aa,aaaaaaaa

owl:1514262:1514308 [0] NCCL INFO Setting affinity for GPU 0 to 55,55555555

owl:1514264:1514309 [2] NCCL INFO Setting affinity for GPU 2 to aa,aaaaaaaa

owl:1514263:1514310 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0

owl:1514262:1514308 [0] NCCL INFO Channel 00/02 : 0 1 2

owl:1514262:1514308 [0] NCCL INFO Channel 01/02 : 0 1 2

owl:1514262:1514308 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1

owl:1514264:1514309 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1

owl:1514263:1514310 [1] NCCL INFO Channel 00 : 1[af000] -> 2[d8000] via SHM/direct/direct

owl:1514263:1514310 [1] NCCL INFO Channel 01 : 1[af000] -> 2[d8000] via SHM/direct/direct

owl:1514262:1514308 [0] NCCL INFO Channel 00 : 0[3b000] -> 1[af000] via SHM/direct/direct

owl:1514262:1514308 [0] NCCL INFO Channel 01 : 0[3b000] -> 1[af000] via SHM/direct/direct

owl:1514264:1514309 [2] NCCL INFO Channel 00 : 2[d8000] -> 0[3b000] via SHM/direct/direct

owl:1514264:1514309 [2] NCCL INFO Channel 01 : 2[d8000] -> 0[3b000] via SHM/direct/direct

owl:1514263:1514310 [1] NCCL INFO Connected all rings

owl:1514264:1514309 [2] NCCL INFO Connected all rings

owl:1514262:1514308 [0] NCCL INFO Connected all rings

owl:1514264:1514309 [2] NCCL INFO Channel 00 : 2[d8000] -> 1[af000] via SHM/direct/direct

owl:1514264:1514309 [2] NCCL INFO Channel 01 : 2[d8000] -> 1[af000] via SHM/direct/direct

owl:1514263:1514310 [1] NCCL INFO Channel 00 : 1[af000] -> 0[3b000] via SHM/direct/direct

owl:1514263:1514310 [1] NCCL INFO Channel 01 : 1[af000] -> 0[3b000] via SHM/direct/direct

owl:1514262:1514308 [0] NCCL INFO Connected all trees

owl:1514262:1514308 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512

owl:1514262:1514308 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer

owl:1514263:1514310 [1] NCCL INFO Connected all trees

owl:1514263:1514310 [1] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512

owl:1514263:1514310 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer

owl:1514264:1514309 [2] NCCL INFO Connected all trees

owl:1514264:1514309 [2] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512

owl:1514264:1514309 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer

owl:1514263:1514310 [1] NCCL INFO comm 0x3a270010 rank 1 nranks 3 cudaDev 1 busId af000 - Init COMPLETE

owl:1514264:1514309 [2] NCCL INFO comm 0x3d6b6c90 rank 2 nranks 3 cudaDev 2 busId d8000 - Init COMPLETE

owl:1514262:1514308 [0] NCCL INFO comm 0x3daa0080 rank 0 nranks 3 cudaDev 0 busId 3b000 - Init COMPLETE

done

done

done

owl:1514262:1514312 [0] NCCL INFO [Service thread] Connection closed by localRank 0

owl:1514262:1514262 [0] NCCL INFO comm 0x3daa0080 rank 0 nranks 3 cudaDev 0 busId 3b000 - Abort COMPLETE

owl:1514264:1514313 [2] NCCL INFO [Service thread] Connection closed by localRank 2

owl:1514264:1514264 [2] NCCL INFO comm 0x3d6b6c90 rank 2 nranks 3 cudaDev 2 busId d8000 - Abort COMPLETE

owl:1514263:1514311 [1] NCCL INFO [Service thread] Connection closed by localRank 1

owl:1514263:1514263 [1] NCCL INFO comm 0x3a270010 rank 1 nranks 3 cudaDev 1 busId af000 - Abort COMPLETE