I want to use 2machine, each 8gpus, to start training, but I am not sure of the usage of main_process_ip
& rdzv_backend
& rdzv_conf
. I would be appreciate if someone could help.
- I have same config.yaml in both nodes as below
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
main_training_function: main
num_processes: 4 # default, set by cli
mixed_precision: no
rdzv_backend: c10d
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
- I execute command like below on both machine, however, I am not sure how to set
main_process_ip
&main_process_ip
in this case. Shouldmain_process_ip
be the current machine’s IP or the rank 0 machine’s IP? What’s more, shouldrdzv_backend
&rdzv_conf
be set at the same time, if yes, how to set them?
accelerate launch --config_file="./configs/dev/acc_mnodes_config.yaml" \
--gpu_ids=0,1,2,3,4,5,6,7 \
--machine_rank=$NODE_RANK \
--num_machines=$N_NODES \
--num_processes=8 \
--main_process_ip="$MASTER_ADDR" \
--main_process_port="$MASTER_PORT" \
--debug \
scripts/train.py
I have set rdzv_backend
to ‘c10d’, but got error as blow:
[E socket.cpp:860] [c10d] The client socket has timed out after 900s while trying to connect to (172.16.247.158, 29500).
Traceback (most recent call last):
File "/running_package/project/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/running_package/project/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/running_package/project/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1066, in launch_command
multi_gpu_launcher(args)
File "/running_package/project/lib/python3.10/site-packages/accelerate/commands/launch.py", line 711, in multi_gpu_launcher
distrib_run.run(args)
File "/running_package/project/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/running_package/project/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/running_package/project/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
result = agent.run()
File "/running_package/project/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/running_package/project/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
result = self._invoke_run(role)
File "/running_package/project/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 858, in _invoke_run
self._initialize_workers(self._worker_group)
File "/running_package/project/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/running_package/project/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 692, in _initialize_workers
self._rendezvous(worker_group)
File "/running_package/project/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/running_package/project/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 546, in _rendezvous
store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
File "/running_package/project/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 55, in next_rendezvous
self._store = TCPStore( # type: ignore[call-arg]
TimeoutError: The client socket has timed out after 900s while trying to connect to (172.16.247.158, 29500)