Multi-node training fails Proxy Call to rank 0 failed (Connect)

I am trying to run multi-node training with two nodes with one GPU in each:
This is my configuration:

compute_environment: LOCAL_MACHINE
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0 # 1 in the second node
main_process_ip: 10.3.40.125 # Same on both nodes
main_process_port: 29500
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 2
use_cpu: false

I get the following logs on running accelerate test --config_file accelerate_config.yaml on the slave node where master node just keeps waiting for synchronization.

Running:  accelerate-launch --config_file=accelerate_config.yaml /home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py
stdout: [2022-12-15 19:47:59,266] [INFO] [comm.py:654:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: DEEPSPEED  Backend: nccl
stdout: Num processes: 2
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout: ds_config: {'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 1, 'zero_optimization': {'stage': 2, 'offload_optimizer': {'device': 'none'}, 'offload_param': {'device': 'none'}, 'stage3_gather_16bit_weights_on_model_save': False}, 'gradient_clipping': 1.0, 'steps_per_print': inf, 'fp16': {'enabled': True, 'auto_cast': True}}
stdout: 
stdout: 
stdout: **Test random number generator synchronization**
stderr: Traceback (most recent call last):
stderr:   File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 359, in <module>
stderr:     main()
stderr:   File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 336, in main
stderr:     rng_sync_check()
stderr:   File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 44, in rng_sync_check
stderr:     synchronize_rng_states(["torch"])
stderr:   File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/utils/random.py", line 88, in synchronize_rng_states
stderr:     synchronize_rng_state(RNGType(rng_type), generator=generator)
stderr:   File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/utils/random.py", line 70, in synchronize_rng_state
stderr:     torch.distributed.broadcast(rng_state, 0)
stderr:   File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1400, in broadcast
stderr:     work = default_pg.broadcast([tensor], opts)
stderr: RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1666642975993/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
stderr: ncclInternalError: Internal check failed.
stderr: Last error:
stderr: Proxy Call to rank 0 failed (Connect)
stderr: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6239) of binary: /home4/nouman_tanveer/anaconda3/envs/ldm/bin/pythonRunning:  accelerate-launch --config_file=accelerate_config.yaml /home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py
stdout: [2022-12-15 19:47:59,266] [INFO] [comm.py:654:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: DEEPSPEED  Backend: nccl
stdout: Num processes: 2
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout: ds_config: {'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 1, 'zero_optimization': {'stage': 2, 'offload_optimizer': {'device': 'none'}, 'offload_param': {'device': 'none'}, 'stage3_gather_16bit_weights_on_model_save': False}, 'gradient_clipping': 1.0, 'steps_per_print': inf, 'fp16': {'enabled': True, 'auto_cast': True}}
stdout: 
stdout: 
stdout: **Test random number generator synchronization**
stderr: Traceback (most recent call last):
stderr:   File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 359, in <module>
stderr:     main()
stderr:   File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 336, in main
stderr:     rng_sync_check()
stderr:   File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 44, in rng_sync_check
stderr:     synchronize_rng_states(["torch"])
stderr:   File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/utils/random.py", line 88, in synchronize_rng_states
stderr:     synchronize_rng_state(RNGType(rng_type), generator=generator)
stderr:   File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/accelerate/utils/random.py", line 70, in synchronize_rng_state
stderr:     torch.distributed.broadcast(rng_state, 0)
stderr:   File "/home4/nouman_tanveer/anaconda3/envs/ldm/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1400, in broadcast
stderr:     work = default_pg.broadcast([tensor], opts)
stderr: RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1666642975993/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
stderr: ncclInternalError: Internal check failed.
stderr: Last error:
stderr: Proxy Call to rank 0 failed (Connect)
stderr: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6239) of binary: /home4/nouman_tanveer/anaconda3/envs/ldm/bin/python

Your help would be appreciated.
Also if anyone can point to sources for beginners in multi-node training that would also be helpful. Accelerate documentation assumes you already know how it works.