Multi-node training

SaulLu · September 5, 2022, 4:15pm

Hello,

Thank you very much for the accelerate lib.

We are currently experiencing a difficulty and were wondering if this could be a known case.

We want to run a training with accelerate and deepspeed on 4 nodes with 4 GPUs each. However, we see in our logs that 4 processes consider to be both a main_process and a local_main_process. We would have expected to see 1 main_process and 4 local_main_process.

Is what we expected wrong? Do you see a mistake we could have made?

Thanks a lot in advance!

SaulLu · September 6, 2022, 7:08am

For someone who may encounter this problem in the future. This was a mistake on our part! We had not specified the deepspeed_multinode_launcher argument in the accelerate configuration, so it was set to None. Adding deepspeed_multinode_launcher: standard inside the deepspeed_config section to the configuration file solved our problem.

mderakhshani · January 16, 2023, 7:15pm

Hi @SaulLu, I am trying to do the same thing but with two nodes, each with 3 GPUs. Could you please share your config file content here?

Topic		Replies	Views
Detecting single gpu within each node 🤗Accelerate	2	760	January 17, 2023
How to launch multi node training using accelerate launch 🤗Accelerate	0	717	May 13, 2024
Multi-node training fails Proxy Call to rank 0 failed (Connect) 🤗Accelerate	7	3853	January 2, 2023
Main code executed twice per process. Normal behaviour? 🤗Accelerate	3	1857	November 17, 2021
Accelerate Multi-GPU on several Nodes How to 🤗Accelerate	3	6370	October 13, 2021

Multi-node training

Related topics