Using Accelerate on an HPC (Slurm)

CamilleP · May 21, 2021, 8:52am

Hi,

I am performing some tests with Accelerate on an HPC (where slurm is usually how we distribute computation). It works on one node and multiple GPU but now I want to try a multi node setup.

I will use your launcher accelerate launch --config_file <config-file> <my script> but then I need to be able to update a couple of the fields from the json file in my script (so during the creation of the Accelerator ?) :

main_process_ip
machine_rank

How can I do that ? Will it be working ?

I am right to think that if my setup is two nodes, each one with 4 GPU, the (range of) value(s) should be:

for “num_process”: 8 (the number of gpu)
for “num_machine”: 2
for “machine rank”: [0,1]
for “distributed_type” : “MULTI_GPU”

Thanks

sgugger · May 21, 2021, 11:50am

How do you usually distribute in multi-node with slurm?
In PyTorch distributed, the main_process_ip is the IP address of the machine of rank 0, so it should work if you enter that.

CamilleP · May 21, 2021, 1:24pm

In slurm, there is srun that launches as many instances of the scripts as there is number of nodes x task (ie process )

Then, from within the script we can retrieve all the slurm environment variable that we need (specifically for the master task and the (local) rank of a process - that is all that is necessary for “dist.init_process_group” in pure pytorch ddp.

What I see as a problem is that I can’t know in advance the nodes (among hundreds available) on which my script will run, so I can’t fully build the config.json file in advance.
That is why I would like to be able to update the main_process_ip and the machine rank within the script.

I don’t know if that makes sense (it is also how we use horovod, ie the setup happens inside the script).

sgugger · May 21, 2021, 1:58pm

It doesn’t look like the Accelerate launcher can help you here, but there is no problem using your usual launching script.

CamilleP · May 21, 2021, 2:01pm

Ok thanks, that is probably easier.
But then how can I let accelerate know who is the master node ?
Is there some arg in Accelerator() that can be set ?

sgugger · May 21, 2021, 2:02pm

It detects everything from the environment (like PyTorch DDP) so you shouldn’t have to worry about anything.

CamilleP · May 21, 2021, 2:04pm

Ok, I will give it a try. It works on one node with several GPU, so I only have one more step to overcome.
Thanks for your answers !

limk1m · September 2, 2021, 8:43am

hi, do u solve this problem?
I have the same problem with u.
I can’t know which machine node in advance, so how to config accelerate

CamilleP · September 8, 2021, 11:07am

Hello,
Sorry I can’t help you right now. I was actually waiting for the tool to be more mature and/or more documented before trying again.
I will post in this thread if I find the time to resume my tests.

Don’t hesitate to post if you find a solution, I would be very interested to read it and try it!

Dahoas · July 25, 2022, 7:42am

Any update on this?

smangrul · July 27, 2022, 5:05am

Hello @Dahoas ,

A user seems to have provided some approach in this comment on a related issue: Using accelerate config for SLURM cluster with dist_url input · Issue #145 · huggingface/accelerate · GitHub

Here is the Git Gist from that comment: distributed dalle2 laion (github.com)

Could you please try it out and let us know if that works?

Topic		Replies	Views
Accelerate Multi-Node Training Beginners	1	7529	October 22, 2024
Slurm Issues running accelerate 🤗Accelerate	1	1063	November 28, 2024
How to launch multi node training using accelerate launch 🤗Accelerate	0	651	May 13, 2024
Multi-node training 🤗Accelerate	2	2969	January 16, 2023
Use `accelerate` in SLURM environment 🤗Accelerate	9	3199	March 3, 2023

Using Accelerate on an HPC (Slurm)

Related topics