In slurm, there is srun that launches as many instances of the scripts as there is number of nodes x task (ie process )
Then, from within the script we can retrieve all the slurm environment variable that we need (specifically for the master task and the (local) rank of a process - that is all that is necessary for “dist.init_process_group” in pure pytorch ddp.
What I see as a problem is that I can’t know in advance the nodes (among hundreds available) on which my script will run, so I can’t fully build the config.json file in advance.
That is why I would like to be able to update the main_process_ip and the machine rank within the script.
I don’t know if that makes sense (it is also how we use horovod, ie the setup happens inside the script).