How do you usually distribute in multi-node with slurm?
In PyTorch distributed, the main_process_ip
is the IP address of the machine of rank 0, so it should work if you enter that.
How do you usually distribute in multi-node with slurm?
In PyTorch distributed, the main_process_ip
is the IP address of the machine of rank 0, so it should work if you enter that.