For multi-node training, the accelerate
library requires manually running accelerate config
on each machine. It is inconvenient if the node number exceeds 10+ (manually setting the configuration for 10+ times). Is there a solution that we can automatically generate the config file on each machine?
You can just use the same single yaml and tweak it for the node number, and copy/paste onto each machine.
Hi @muellerzr, I was wondering whether there is such a tool. If not, I would develop one. Thank you very much for your answers.