When I run:
import torch.nn as nn
from accelerate import Accelerator
if __name__ == "__main__":
accelerator = Accelerator()
model = nn.Conv2d(10, 20, 3, 1, 1)
print("prepare")
model = accelerator.prepare(model)
print("done")
with config:
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 3
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
“done” never gets printed.
“done” gets printed if I set num_process to 1
But on some other server using the same config, “done” gets printed
Debug message for the one that’s not working:
prepare
prepare
prepare
osprey2:1465885:1465885 [0] NCCL INFO Bootstrap : Using eno1:128.174.136.28<0>
osprey2:1465885:1465885 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
osprey2:1465885:1465885 [0] NCCL INFO cudaDriverVersion 11070
NCCL version 2.14.3+cuda11.7
osprey2:1465885:1466929 [0] NCCL INFO Failed to open libibverbs.so[.1]
osprey2:1465885:1466929 [0] NCCL INFO NET/Socket : Using [0]eno1:128.174.136.28<0>
osprey2:1465885:1466929 [0] NCCL INFO Using network Socket
osprey2:1465891:1465891 [2] NCCL INFO cudaDriverVersion 11070
osprey2:1465889:1465889 [1] NCCL INFO cudaDriverVersion 11070
osprey2:1465889:1465889 [1] NCCL INFO Bootstrap : Using eno1:128.174.136.28<0>
osprey2:1465889:1465889 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
osprey2:1465889:1466930 [1] NCCL INFO Failed to open libibverbs.so[.1]
osprey2:1465889:1466930 [1] NCCL INFO NET/Socket : Using [0]eno1:128.174.136.28<0>
osprey2:1465889:1466930 [1] NCCL INFO Using network Socket
osprey2:1465891:1465891 [2] NCCL INFO Bootstrap : Using eno1:128.174.136.28<0>
osprey2:1465891:1465891 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
osprey2:1465891:1466931 [2] NCCL INFO Failed to open libibverbs.so[.1]
osprey2:1465891:1466931 [2] NCCL INFO NET/Socket : Using [0]eno1:128.174.136.28<0>
osprey2:1465891:1466931 [2] NCCL INFO Using network Socket
osprey2:1465889:1466930 [1] NCCL INFO Setting affinity for GPU 1 to ffff0000,ffff0000
osprey2:1465891:1466931 [2] NCCL INFO Setting affinity for GPU 2 to ffff0000,ffff0000
osprey2:1465885:1466929 [0] NCCL INFO Setting affinity for GPU 0 to ffff,0000ffff
osprey2:1465891:1466931 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1 [2] -1/-1/-1->2->1 [3] -1/-1/-1->2->1
osprey2:1465889:1466930 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0
osprey2:1465885:1466929 [0] NCCL INFO Channel 00/04 : 0 1 2
osprey2:1465885:1466929 [0] NCCL INFO Channel 01/04 : 0 1 2
osprey2:1465885:1466929 [0] NCCL INFO Channel 02/04 : 0 1 2
osprey2:1465885:1466929 [0] NCCL INFO Channel 03/04 : 0 1 2
osprey2:1465885:1466929 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
osprey2:1465885:1466929 [0] NCCL INFO Channel 00/0 : 0[21000] -> 1[81000] via P2P/IPC
osprey2:1465889:1466930 [1] NCCL INFO Channel 00/0 : 1[81000] -> 2[e2000] via P2P/IPC
osprey2:1465891:1466931 [2] NCCL INFO Channel 00/0 : 2[e2000] -> 0[21000] via P2P/IPC
osprey2:1465885:1466929 [0] NCCL INFO Channel 01/0 : 0[21000] -> 1[81000] via P2P/IPC
osprey2:1465889:1466930 [1] NCCL INFO Channel 01/0 : 1[81000] -> 2[e2000] via P2P/IPC
osprey2:1465891:1466931 [2] NCCL INFO Channel 01/0 : 2[e2000] -> 0[21000] via P2P/IPC
osprey2:1465885:1466929 [0] NCCL INFO Channel 02/0 : 0[21000] -> 1[81000] via P2P/IPC
osprey2:1465889:1466930 [1] NCCL INFO Channel 02/0 : 1[81000] -> 2[e2000] via P2P/IPC
osprey2:1465891:1466931 [2] NCCL INFO Channel 02/0 : 2[e2000] -> 0[21000] via P2P/IPC
osprey2:1465885:1466929 [0] NCCL INFO Channel 03/0 : 0[21000] -> 1[81000] via P2P/IPC
osprey2:1465889:1466930 [1] NCCL INFO Channel 03/0 : 1[81000] -> 2[e2000] via P2P/IPC
osprey2:1465891:1466931 [2] NCCL INFO Channel 03/0 : 2[e2000] -> 0[21000] via P2P/IPC
osprey2:1465889:1466930 [1] NCCL INFO Connected all rings
osprey2:1465891:1466931 [2] NCCL INFO Connected all rings
osprey2:1465891:1466931 [2] NCCL INFO Channel 00/0 : 2[e2000] -> 1[81000] via P2P/IPC
osprey2:1465885:1466929 [0] NCCL INFO Connected all rings
osprey2:1465891:1466931 [2] NCCL INFO Channel 01/0 : 2[e2000] -> 1[81000] via P2P/IPC
osprey2:1465891:1466931 [2] NCCL INFO Channel 02/0 : 2[e2000] -> 1[81000] via P2P/IPC
osprey2:1465891:1466931 [2] NCCL INFO Channel 03/0 : 2[e2000] -> 1[81000] via P2P/IPC
osprey2:1465889:1466930 [1] NCCL INFO Channel 00/0 : 1[81000] -> 0[21000] via P2P/IPC
osprey2:1465889:1466930 [1] NCCL INFO Channel 01/0 : 1[81000] -> 0[21000] via P2P/IPC
osprey2:1465889:1466930 [1] NCCL INFO Channel 02/0 : 1[81000] -> 0[21000] via P2P/IPC
osprey2:1465889:1466930 [1] NCCL INFO Channel 03/0 : 1[81000] -> 0[21000] via P2P/IPC
osprey2:1465891:1466931 [2] NCCL INFO Connected all trees
osprey2:1465891:1466931 [2] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
osprey2:1465891:1466931 [2] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
osprey2:1465889:1466930 [1] NCCL INFO Connected all trees
osprey2:1465889:1466930 [1] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
osprey2:1465889:1466930 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
osprey2:1465885:1466929 [0] NCCL INFO Connected all trees
osprey2:1465885:1466929 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
osprey2:1465885:1466929 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
osprey2:1465885:1466929 [0] NCCL INFO comm 0x3d1af060 rank 0 nranks 3 cudaDev 0 busId 21000 - Init COMPLETE
osprey2:1465891:1466931 [2] NCCL INFO comm 0x3de99e20 rank 2 nranks 3 cudaDev 2 busId e2000 - Init COMPLETE
osprey2:1465889:1466930 [1] NCCL INFO comm 0x3bd29b10 rank 1 nranks 3 cudaDev 1 busId 81000 - Init COMPLETE
debug message for the one working:
prepare
prepare
prepare
owl:1514262:1514262 [0] NCCL INFO Bootstrap : Using eno1np0:172.22.224.10<0>
owl:1514262:1514262 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
owl:1514262:1514262 [0] NCCL INFO cudaDriverVersion 12000
NCCL version 2.14.3+cuda11.7
owl:1514262:1514308 [0] NCCL INFO Failed to open libibverbs.so[.1]
owl:1514262:1514308 [0] NCCL INFO NET/Socket : Using [0]eno1np0:172.22.224.10<0>
owl:1514262:1514308 [0] NCCL INFO Using network Socket
owl:1514264:1514264 [2] NCCL INFO cudaDriverVersion 12000
owl:1514263:1514263 [1] NCCL INFO cudaDriverVersion 12000
owl:1514264:1514264 [2] NCCL INFO Bootstrap : Using eno1np0:172.22.224.10<0>
owl:1514264:1514264 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
owl:1514264:1514309 [2] NCCL INFO Failed to open libibverbs.so[.1]
owl:1514264:1514309 [2] NCCL INFO NET/Socket : Using [0]eno1np0:172.22.224.10<0>
owl:1514264:1514309 [2] NCCL INFO Using network Socket
owl:1514263:1514263 [1] NCCL INFO Bootstrap : Using eno1np0:172.22.224.10<0>
owl:1514263:1514263 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
owl:1514263:1514310 [1] NCCL INFO Failed to open libibverbs.so[.1]
owl:1514263:1514310 [1] NCCL INFO NET/Socket : Using [0]eno1np0:172.22.224.10<0>
owl:1514263:1514310 [1] NCCL INFO Using network Socket
owl:1514263:1514310 [1] NCCL INFO Setting affinity for GPU 1 to aa,aaaaaaaa
owl:1514262:1514308 [0] NCCL INFO Setting affinity for GPU 0 to 55,55555555
owl:1514264:1514309 [2] NCCL INFO Setting affinity for GPU 2 to aa,aaaaaaaa
owl:1514263:1514310 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
owl:1514262:1514308 [0] NCCL INFO Channel 00/02 : 0 1 2
owl:1514262:1514308 [0] NCCL INFO Channel 01/02 : 0 1 2
owl:1514262:1514308 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
owl:1514264:1514309 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
owl:1514263:1514310 [1] NCCL INFO Channel 00 : 1[af000] -> 2[d8000] via SHM/direct/direct
owl:1514263:1514310 [1] NCCL INFO Channel 01 : 1[af000] -> 2[d8000] via SHM/direct/direct
owl:1514262:1514308 [0] NCCL INFO Channel 00 : 0[3b000] -> 1[af000] via SHM/direct/direct
owl:1514262:1514308 [0] NCCL INFO Channel 01 : 0[3b000] -> 1[af000] via SHM/direct/direct
owl:1514264:1514309 [2] NCCL INFO Channel 00 : 2[d8000] -> 0[3b000] via SHM/direct/direct
owl:1514264:1514309 [2] NCCL INFO Channel 01 : 2[d8000] -> 0[3b000] via SHM/direct/direct
owl:1514263:1514310 [1] NCCL INFO Connected all rings
owl:1514264:1514309 [2] NCCL INFO Connected all rings
owl:1514262:1514308 [0] NCCL INFO Connected all rings
owl:1514264:1514309 [2] NCCL INFO Channel 00 : 2[d8000] -> 1[af000] via SHM/direct/direct
owl:1514264:1514309 [2] NCCL INFO Channel 01 : 2[d8000] -> 1[af000] via SHM/direct/direct
owl:1514263:1514310 [1] NCCL INFO Channel 00 : 1[af000] -> 0[3b000] via SHM/direct/direct
owl:1514263:1514310 [1] NCCL INFO Channel 01 : 1[af000] -> 0[3b000] via SHM/direct/direct
owl:1514262:1514308 [0] NCCL INFO Connected all trees
owl:1514262:1514308 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
owl:1514262:1514308 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
owl:1514263:1514310 [1] NCCL INFO Connected all trees
owl:1514263:1514310 [1] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
owl:1514263:1514310 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
owl:1514264:1514309 [2] NCCL INFO Connected all trees
owl:1514264:1514309 [2] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
owl:1514264:1514309 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
owl:1514263:1514310 [1] NCCL INFO comm 0x3a270010 rank 1 nranks 3 cudaDev 1 busId af000 - Init COMPLETE
owl:1514264:1514309 [2] NCCL INFO comm 0x3d6b6c90 rank 2 nranks 3 cudaDev 2 busId d8000 - Init COMPLETE
owl:1514262:1514308 [0] NCCL INFO comm 0x3daa0080 rank 0 nranks 3 cudaDev 0 busId 3b000 - Init COMPLETE
done
done
done
owl:1514262:1514312 [0] NCCL INFO [Service thread] Connection closed by localRank 0
owl:1514262:1514262 [0] NCCL INFO comm 0x3daa0080 rank 0 nranks 3 cudaDev 0 busId 3b000 - Abort COMPLETE
owl:1514264:1514313 [2] NCCL INFO [Service thread] Connection closed by localRank 2
owl:1514264:1514264 [2] NCCL INFO comm 0x3d6b6c90 rank 2 nranks 3 cudaDev 2 busId d8000 - Abort COMPLETE
owl:1514263:1514311 [1] NCCL INFO [Service thread] Connection closed by localRank 1
owl:1514263:1514263 [1] NCCL INFO comm 0x3a270010 rank 1 nranks 3 cudaDev 1 busId af000 - Abort COMPLETE