How to launch multi node training using accelerate launch

I want to use 2machine, each 8gpus, to start training, but I am not sure of the usage of main_process_ip & rdzv_backend & rdzv_conf. I would be appreciate if someone could help.

  1. I have same config.yaml in both nodes as below
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
main_training_function: main
num_processes: 4 # default, set by cli
mixed_precision: no
rdzv_backend: c10d
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
  1. I execute command like below on both machine, however, I am not sure how to set main_process_ip & main_process_ip in this case. Should main_process_ip be the current machine’s IP or the rank 0 machine’s IP? What’s more, should rdzv_backend & rdzv_conf be set at the same time, if yes, how to set them?
accelerate launch --config_file="./configs/dev/acc_mnodes_config.yaml" \
  --gpu_ids=0,1,2,3,4,5,6,7 \
  --machine_rank=$NODE_RANK \
  --num_machines=$N_NODES \
  --num_processes=8 \
  --main_process_ip="$MASTER_ADDR" \
  --main_process_port="$MASTER_PORT" \
  --debug \
  scripts/train.py 

I have set rdzv_backend to ‘c10d’, but got error as blow:

[E socket.cpp:860] [c10d] The client socket has timed out after 900s while trying to connect to (172.16.247.158, 29500).
Traceback (most recent call last):
  File "/running_package/project/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/running_package/project/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/running_package/project/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1066, in launch_command
    multi_gpu_launcher(args)
  File "/running_package/project/lib/python3.10/site-packages/accelerate/commands/launch.py", line 711, in multi_gpu_launcher
    distrib_run.run(args)
  File "/running_package/project/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/running_package/project/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/running_package/project/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
    result = agent.run()
  File "/running_package/project/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/running_package/project/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
    result = self._invoke_run(role)
  File "/running_package/project/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 858, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/running_package/project/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/running_package/project/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 692, in _initialize_workers
    self._rendezvous(worker_group)
  File "/running_package/project/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/running_package/project/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 546, in _rendezvous
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
  File "/running_package/project/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 55, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
TimeoutError: The client socket has timed out after 900s while trying to connect to (172.16.247.158, 29500)