Tranier not starting on multi-GPU setting

I am finetuning a DeBERTa-v3-large model on classification, using huggingface trainer. I am using a machine with two GPUs (one node). The script works correctly when I force it on a single GPU using CUDA_VISIBLE_DEVICE=0 or 1, but when I let it run on both of them it gets stuck here (the dataset is tokenized and cached, but it tokenizes it also when using 2 GPUs):

2/06/2024 15:52:35 - INFO - utils.training_utils - Device = cuda
02/06/2024 15:52:35 - INFO - utils.training_utils - Learning rate = 2e-05
02/06/2024 15:52:35 - INFO - utils.training_utils - Dataset loaded!
ciao
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-xsmall and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
02/06/2024 15:52:37 - INFO - utils.training_utils - Dataset is ready!
  0%|                                                                                                                                                                             | 0/55000 [00:00<?, ?it/s]


I can’t even stop it using CTRL+C but i have to kill it by PID. I have already looked for a solution online but without success. My accelerate env output is:

- `Accelerate` version: 0.26.1
- Platform: Linux-6.5.0-15-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- Numpy version: 1.26.3
- PyTorch version (GPU?): 2.2.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 62.57 GB
- GPU type: NVIDIA GeForce RTX 4090
- `Accelerate` default config:
	- compute_environment: LOCAL_MACHINE
	- distributed_type: MULTI_GPU
	- mixed_precision: no
	- use_cpu: False
	- debug: False
	- num_processes: 2
	- machine_rank: 0
	- num_machines: 1
	- gpu_ids: all
	- rdzv_backend: static
	- same_network: True
	- main_training_function: main
	- downcast_bf16: no
	- tpu_use_cluster: False
	- tpu_use_sudo: False
	- tpu_env: []

I also tried " export NCCL_P2P_DISABLE=1" and “export NCCL_SOCKET_IFNAME=lo”. My NCCL info is the following:

gpu3:9421:9421 [0] NCCL INFO cudaDriverVersion 12020
gpu3:9421:9421 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
gpu3:9421:9421 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
gpu3:9421:9421 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
NCCL version 2.19.3+cuda12.3
gpu3:9421:9591 [0] NCCL INFO Failed to open libibverbs.so[.1]
gpu3:9421:9591 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
gpu3:9421:9591 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
gpu3:9421:9591 [0] NCCL INFO Using non-device net plugin version 0
gpu3:9421:9591 [0] NCCL INFO Using network Socket
gpu3:9421:9592 [1] NCCL INFO Using non-device net plugin version 0
gpu3:9421:9592 [1] NCCL INFO Using network Socket
gpu3:9421:9592 [1] NCCL INFO comm 0x55efb86fb940 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 2000 commId 0xad8758cb71a5a7a1 - Init START
gpu3:9421:9591 [0] NCCL INFO comm 0x55efb86f8200 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1000 commId 0xad8758cb71a5a7a1 - Init START
gpu3:9421:9592 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
gpu3:9421:9591 [0] NCCL INFO Channel 00/02 :    0   1
gpu3:9421:9591 [0] NCCL INFO Channel 01/02 :    0   1
gpu3:9421:9592 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
gpu3:9421:9592 [1] NCCL INFO P2P Chunksize set to 131072
gpu3:9421:9591 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
gpu3:9421:9591 [0] NCCL INFO P2P Chunksize set to 131072
gpu3:9421:9592 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
gpu3:9421:9591 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
gpu3:9421:9592 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
gpu3:9421:9591 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
gpu3:9421:9591 [0] NCCL INFO Connected all rings
gpu3:9421:9591 [0] NCCL INFO Connected all trees
gpu3:9421:9592 [1] NCCL INFO Connected all rings
gpu3:9421:9592 [1] NCCL INFO Connected all trees
gpu3:9421:9592 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
gpu3:9421:9592 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
gpu3:9421:9591 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
gpu3:9421:9591 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
gpu3:9421:9592 [1] NCCL INFO comm 0x55efb86fb940 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 2000 commId 0xad8758cb71a5a7a1 - Init COMPLETE
gpu3:9421:9591 [0] NCCL INFO comm 0x55efb86f8200 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1000 commId 0xad8758cb71a5a7a1 - Init COMPLETE

I also tried removing and recreating the virtual env but nothing works. Any suggestion?

After some time I found out that the problem was the use of enable_full_determinism(). I had to remove the call.