Using Accelerate on an HPC (Slurm)

Hi,

I am performing some tests with Accelerate on an HPC (where slurm is usually how we distribute computation). It works on one node and multiple GPU but now I want to try a multi node setup.

I will use your launcher accelerate launch --config_file <config-file> <my script> but then I need to be able to update a couple of the fields from the json file in my script (so during the creation of the Accelerator ?) :

  • main_process_ip
  • machine_rank

How can I do that ? Will it be working ?

I am right to think that if my setup is two nodes, each one with 4 GPU, the (range of) value(s) should be:

  • for “num_process”: 8 (the number of gpu)
  • for “num_machine”: 2
  • for “machine rank”: [0,1]
  • for “distributed_type” : “MULTI_GPU”

Thanks

How do you usually distribute in multi-node with slurm?
In PyTorch distributed, the main_process_ip is the IP address of the machine of rank 0, so it should work if you enter that.

In slurm, there is srun that launches as many instances of the scripts as there is number of nodes x task (ie process )

Then, from within the script we can retrieve all the slurm environment variable that we need (specifically for the master task and the (local) rank of a process - that is all that is necessary for “dist.init_process_group” in pure pytorch ddp.

What I see as a problem is that I can’t know in advance the nodes (among hundreds available) on which my script will run, so I can’t fully build the config.json file in advance.
That is why I would like to be able to update the main_process_ip and the machine rank within the script.

I don’t know if that makes sense (it is also how we use horovod, ie the setup happens inside the script).

It doesn’t look like the Accelerate launcher can help you here, but there is no problem using your usual launching script.

Ok thanks, that is probably easier.
But then how can I let accelerate know who is the master node ?
Is there some arg in Accelerator() that can be set ?

It detects everything from the environment (like PyTorch DDP) so you shouldn’t have to worry about anything.

Ok, I will give it a try. It works on one node with several GPU, so I only have one more step to overcome.
Thanks for your answers !