Accelerate Multi-GPU on several Nodes How to

Hi,

I wonder how to setup Accelerate or possibly train a model if I have 2 physical machines sitting in the same network. Each machine has 4 GPUs.

Can I use Accelerate + DeepSpeed to train a model with this configuration ?

Can’t seem to be able to find any writeups or example how to perform the “accelerate config”.

Thanks.

I’m not sure what documentation you need, just type accelerate config in the terminal of both machines and follow the prompts.

Hi,

I have read the doc from accelerate. This doc shows how I can perform training on a single multi-gpu machine (one machine) using the “accelerate config”.

I am looking for example, how to perform training on 2 multi-gpu machines. In other words, in my setup, I have 4 x GPU per machine.

What are the packages I needs to install ? For example:

  • machine 1, I install accelerate & deepspeed. Run accelerate config
  • machine 2, do I also just install accelerate & deepspeed ?

Is the training on multi-gpu using 2 machines possible ?

Like I said, you need to run accelerate config on both machines (and yes you need to install everything you need on both of them), then run accelerate launch training_script.py on both machines as well.