Multi-node training


Thank you very much for the accelerate lib. :hugs:

We are currently experiencing a difficulty and were wondering if this could be a known case.

We want to run a training with accelerate and deepspeed on 4 nodes with 4 GPUs each. However, we see in our logs that 4 processes consider to be both a main_process and a local_main_process. We would have expected to see 1 main_process and 4 local_main_process.

Is what we expected wrong? Do you see a mistake we could have made?

Thanks a lot in advance!

For someone who may encounter this problem in the future. This was a mistake on our part! We had not specified the deepspeed_multinode_launcher argument in the accelerate configuration, so it was set to None. Adding deepspeed_multinode_launcher: standard inside the deepspeed_config section to the configuration file solved our problem.

1 Like