Using Accelerate on an HPC (Slurm)

CamilleP · May 21, 2021, 1:24pm

In slurm, there is srun that launches as many instances of the scripts as there is number of nodes x task (ie process )

Then, from within the script we can retrieve all the slurm environment variable that we need (specifically for the master task and the (local) rank of a process - that is all that is necessary for “dist.init_process_group” in pure pytorch ddp.

What I see as a problem is that I can’t know in advance the nodes (among hundreds available) on which my script will run, so I can’t fully build the config.json file in advance.
That is why I would like to be able to update the main_process_ip and the machine rank within the script.

I don’t know if that makes sense (it is also how we use horovod, ie the setup happens inside the script).

Topic		Replies	Views
Accelerate Multi-Node Training Beginners	1	7453	October 22, 2024
Slurm Issues running accelerate 🤗Accelerate	1	1037	November 28, 2024
How to launch multi node training using accelerate launch 🤗Accelerate	0	634	May 13, 2024
Multi-node training 🤗Accelerate	2	2945	January 16, 2023
Use `accelerate` in SLURM environment 🤗Accelerate	9	3195	March 3, 2023

Using Accelerate on an HPC (Slurm)

Related topics