Multi-node training fails Proxy Call to rank 0 failed (Connect)

I am trying to run multi-node training with two nodes with one GPU in each:
This is my configuration:

compute_environment: LOCAL_MACHINE
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0 # 1 in the second node
main_process_ip: 10.3.40.125 # Same on both nodes
main_process_port: 29500
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 2
use_cpu: false

I get the following logs on running accelerate test --config_file accelerate_config.yaml on the slave node where master node just keeps waiting for synchronization.

Running:  accelerate-launch --config_file=accelerate_config.yaml /home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py
stdout: [2022-12-15 19:47:59,266] [INFO] [comm.py:654:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: DEEPSPEED  Backend: nccl
stdout: Num processes: 2
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout: ds_config: {'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 1, 'zero_optimization': {'stage': 2, 'offload_optimizer': {'device': 'none'}, 'offload_param': {'device': 'none'}, 'stage3_gather_16bit_weights_on_model_save': False}, 'gradient_clipping': 1.0, 'steps_per_print': inf, 'fp16': {'enabled': True, 'auto_cast': True}}
stdout: 
stdout: 
stdout: **Test random number generator synchronization**
stderr: Traceback (most recent call last):
stderr:   File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 359, in <module>
stderr:     main()
stderr:   File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 336, in main
stderr:     rng_sync_check()
stderr:   File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 44, in rng_sync_check
stderr:     synchronize_rng_states(["torch"])
stderr:   File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/utils/random.py", line 88, in synchronize_rng_states
stderr:     synchronize_rng_state(RNGType(rng_type), generator=generator)
stderr:   File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/utils/random.py", line 70, in synchronize_rng_state
stderr:     torch.distributed.broadcast(rng_state, 0)
stderr:   File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1400, in broadcast
stderr:     work = default_pg.broadcast([tensor], opts)
stderr: RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1666642975993/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
stderr: ncclInternalError: Internal check failed.
stderr: Last error:
stderr: Proxy Call to rank 0 failed (Connect)
stderr: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6239) of binary: /home4/nouman_tanveer/anaconda3/envs/ldm/bin/pythonRunning:  accelerate-launch --config_file=accelerate_config.yaml /home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py
stdout: [2022-12-15 19:47:59,266] [INFO] [comm.py:654:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: DEEPSPEED  Backend: nccl
stdout: Num processes: 2
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout: ds_config: {'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 1, 'zero_optimization': {'stage': 2, 'offload_optimizer': {'device': 'none'}, 'offload_param': {'device': 'none'}, 'stage3_gather_16bit_weights_on_model_save': False}, 'gradient_clipping': 1.0, 'steps_per_print': inf, 'fp16': {'enabled': True, 'auto_cast': True}}
stdout: 
stdout: 
stdout: **Test random number generator synchronization**
stderr: Traceback (most recent call last):
stderr:   File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 359, in <module>
stderr:     main()
stderr:   File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 336, in main
stderr:     rng_sync_check()
stderr:   File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 44, in rng_sync_check
stderr:     synchronize_rng_states(["torch"])
stderr:   File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/utils/random.py", line 88, in synchronize_rng_states
stderr:     synchronize_rng_state(RNGType(rng_type), generator=generator)
stderr:   File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/utils/random.py", line 70, in synchronize_rng_state
stderr:     torch.distributed.broadcast(rng_state, 0)
stderr:   File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1400, in broadcast
stderr:     work = default_pg.broadcast([tensor], opts)
stderr: RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1666642975993/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
stderr: ncclInternalError: Internal check failed.
stderr: Last error:
stderr: Proxy Call to rank 0 failed (Connect)
stderr: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6239) of binary: /home4/nouman_tanveer/anaconda3/envs/ldm/bin/python

Your help would be appreciated.
Also if anyone can point to sources for beginners in multi-node training that would also be helpful. Accelerate documentation assumes you already know how it works.

I think you want this to be num_processes: 1 because it’s the number of GPUs per machine not in total

1 Like

Thank you, that fixed it.

1 Like

@muellerzr but when I try to run the nlp_example.py with the updated config from the repo it doesn’t process that. I tried prepending NCCL_DEBUG=INFO still it doesn’t return anything. However, it does work when I launch it with the default config on a single node.

Can you print the full config file and I can try to take a look at this before the team is on vacation :slight_smile:

Also the contents of accelerate env and your deepspeed version preferably.

cc @smangrul as well :slight_smile:

This is the default config I’m overwriting:

- `Accelerate` version: 0.15.0
- Platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.35
- Python version: 3.9.15
- Numpy version: 1.23.4
- PyTorch version (GPU?): 1.13.0 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: NO
        - mixed_precision: fp16
        - use_cpu: False
        - dynamo_backend: NO
        - num_processes: 1
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - main_process_ip: None
        - main_process_port: None
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {}
        - fsdp_config: {}
        - megatron_lm_config: {}
        - downcast_bf16: no
        - tpu_name: None
        - tpu_zone: None
        - command_file: None
        - commands: None

with:

compute_environment: LOCAL_MACHINE
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0 # 1 in the second node
main_process_ip: 10.3.40.125 # Same on both nodes
main_process_port: 29500
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 2
use_cpu: false

Deepspeed version:

deepspeed                 0.7.7                    pypi_0    pypi

Thank you for your help.

1 Like

Hello @Noman, does simple Multi-GPU training across the 2 nodes work?