Using Accelerate on an HPC (Slurm)

Hi,

I am performing some tests with Accelerate on an HPC (where slurm is usually how we distribute computation). It works on one node and multiple GPU but now I want to try a multi node setup.

I will use your launcher accelerate launch --config_file <config-file> <my script> but then I need to be able to update a couple of the fields from the json file in my script (so during the creation of the Accelerator ?) :

  • main_process_ip
  • machine_rank

How can I do that ? Will it be working ?

I am right to think that if my setup is two nodes, each one with 4 GPU, the (range of) value(s) should be:

  • for “num_process”: 8 (the number of gpu)
  • for “num_machine”: 2
  • for “machine rank”: [0,1]
  • for “distributed_type” : “MULTI_GPU”

Thanks

2 Likes

How do you usually distribute in multi-node with slurm?
In PyTorch distributed, the main_process_ip is the IP address of the machine of rank 0, so it should work if you enter that.

In slurm, there is srun that launches as many instances of the scripts as there is number of nodes x task (ie process )

Then, from within the script we can retrieve all the slurm environment variable that we need (specifically for the master task and the (local) rank of a process - that is all that is necessary for “dist.init_process_group” in pure pytorch ddp.

What I see as a problem is that I can’t know in advance the nodes (among hundreds available) on which my script will run, so I can’t fully build the config.json file in advance.
That is why I would like to be able to update the main_process_ip and the machine rank within the script.

I don’t know if that makes sense (it is also how we use horovod, ie the setup happens inside the script).

It doesn’t look like the Accelerate launcher can help you here, but there is no problem using your usual launching script.

Ok thanks, that is probably easier.
But then how can I let accelerate know who is the master node ?
Is there some arg in Accelerator() that can be set ?

It detects everything from the environment (like PyTorch DDP) so you shouldn’t have to worry about anything.

1 Like

Ok, I will give it a try. It works on one node with several GPU, so I only have one more step to overcome.
Thanks for your answers !

hi, do u solve this problem?
I have the same problem with u.
I can’t know which machine node in advance, so how to config accelerate

Hello,
Sorry I can’t help you right now. I was actually waiting for the tool to be more mature and/or more documented before trying again.
I will post in this thread if I find the time to resume my tests.

Don’t hesitate to post if you find a solution, I would be very interested to read it and try it!

Any update on this?

Hello @Dahoas ,

A user seems to have provided some approach in this comment on a related issue: Using accelerate config for SLURM cluster with dist_url input · Issue #145 · huggingface/accelerate · GitHub

Here is the Git Gist from that comment: distributed dalle2 laion (github.com)

Could you please try it out and let us know if that works?