I have 2 machines, each with 2 V100 on GCP that run a multi-node training but get a timeout when I run accelerate launch train.py
I am not sure if I didn’t config the GCP firewall correctly or accelerate
Traceback (most recent call last):
File "/home/yangg/venv/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/yangg/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/yangg/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 977, in launch_command
multi_gpu_launcher(args)
File "/home/yangg/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
result = agent.run()
File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
result = self._invoke_run(role)
File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 858, in _invoke_run
self._initialize_workers(self._worker_group)
File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 692, in _initialize_workers
self._rendezvous(worker_group)
File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 549, in _rendezvous
workers = self._assign_worker_ranks(store, group_rank, group_world_size, spec)
File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 624, in _assign_worker_ranks
role_infos = self._share_and_gather(store, group_rank, group_world_size, spec)
File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 661, in _share_and_gather
role_infos_bytes = store_util.synchronize(
File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
agent_data = get_all(store, rank, key_prefix, world_size)
File "/home/yangg/venv/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
data = store.get(f"{prefix}{idx}")
RuntimeError: Socket Timeout
here is the default_config.yaml
that I use for main_process rank:0
machine:
compute_environment: LOCAL_MACHINE
debug: true
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_process_ip: 34.72.108.133
main_process_port: 29500
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 4
rdzv_backend: static
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
here is rank:0
machine’s default_config.yaml
:
compute_environment: LOCAL_MACHINE
debug: true
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 1
main_process_ip: 34.72.108.133
main_process_port: 29500
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 4
rdzv_backend: static
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Besides, I tested that it succeeded when I tried to connect from #1 machine to #0(main process) via telnet 34.72.108.133 29500
. So I believe the GCP network was correctly configured.
Any suggestions or tutorials about GPC multinode training will be appreciated.