Time out for Multi node training on Google Cloud (GCP)

I have 2 machines, each with 2 V100 on GCP that run a multi-node training but get a timeout when I run accelerate launch train.py I am not sure if I didn’t config the GCP firewall correctly or accelerate

Traceback (most recent call last):
  File "/home/yangg/venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/yangg/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/yangg/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 977, in launch_command
    multi_gpu_launcher(args)
  File "/home/yangg/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
    result = agent.run()
  File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
    result = self._invoke_run(role)
  File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 858, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 692, in _initialize_workers
    self._rendezvous(worker_group)
  File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 549, in _rendezvous
    workers = self._assign_worker_ranks(store, group_rank, group_world_size, spec)
  File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 624, in _assign_worker_ranks
    role_infos = self._share_and_gather(store, group_rank, group_world_size, spec)
  File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 661, in _share_and_gather
    role_infos_bytes = store_util.synchronize(
  File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
    agent_data = get_all(store, rank, key_prefix, world_size)
  File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
    data = store.get(f"{prefix}{idx}")
RuntimeError: Socket Timeout

here is the default_config.yaml that I use for main_process rank:0 machine:

compute_environment: LOCAL_MACHINE
debug: true
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_process_ip: 34.72.108.133
main_process_port: 29500
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 4
rdzv_backend: static
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

here is rank:0 machine’s default_config.yaml:

compute_environment: LOCAL_MACHINE
debug: true
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 1
main_process_ip: 34.72.108.133
main_process_port: 29500
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 4
rdzv_backend: static
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Besides, I tested that it succeeded when I tried to connect from #1 machine to #0(main process) via telnet 34.72.108.133 29500. So I believe the GCP network was correctly configured.
image

Any suggestions or tutorials about GPC multinode training will be appreciated.

NVM. I put the IP address wrong. Maybe it’s a good idea that make a tutorial page for GCP multi-node training. I am happy to submit one to share our experience. :hugs:

Can you pls share the setup details? A brief write up would be super helpful.