Accelerate socket timeout on multi-node LLM training

LayZ007 · June 14, 2024, 9:33am

I have been trying recently to distribute the fine-tuning of an LLM model on multi-node (multiple machines). I have two machines with a GTX 3060 each the first machine (host, 172.21.202.40) is configured as follows :
compute_environment: LOCAL_MACHINE debug: true distributed_type: MULTI_GPU downcast_bf16: 'no' enable_cpu_affinity: false gpu_ids: '[0]' machine_rank: 0 main_process_ip: 172.21.202.40 main_process_port: 5000 main_training_function: main mixed_precision: bf16 num_machines: 2 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

The other machine (172.21.201.195) is configured like the following :
compute_environment: LOCAL_MACHINE debug: true distributed_type: MULTI_GPU downcast_bf16: 'no' enable_cpu_affinity: false gpu_ids: '[0]' machine_rank: 0 main_process_ip: 172.21.202.40 main_process_port: 5000 main_training_function: main mixed_precision: bf16 num_machines: 2 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

I try to run the tst file given by Accelerate using the command accelerate test, the two machines seem to be hanging until i get a timeout error. I checked the traffic between the two machines using wireshark and they do seem to exchange data. The stream i caught is as follows :

I am not sure what the binary code is. But the data transfer seems to be working fine, both my terminals hang on this line Running: accelerate-launch /home/ubuntu/.local/lib/python3.10/site-packages/accelerate/test_utils/scripts/test_script.py and after while it send a socket timeout error.

Topic		Replies	Views
Accelerate Multi-Node Training Beginners	1	7746	October 22, 2024
Time out for Multi node training on Google Cloud (GCP) 🤗Accelerate	2	890	September 9, 2023
How to launch multi node training using accelerate launch 🤗Accelerate	0	692	May 13, 2024
Multi-node training fails Proxy Call to rank 0 failed (Connect) 🤗Accelerate	7	3846	January 2, 2023
Detecting single gpu within each node 🤗Accelerate	2	758	January 17, 2023

Accelerate socket timeout on multi-node LLM training

Related topics