Run training script in DDP using GLOO

Hello,

I am desperately trying to run my training script in DDP using GLOO (on windows). I managed to do it on a single machine having 2 GPUs running the following command:

launch.py
–nproc_per_node=2
my_script.py

As NLCC is not available on windows I had to tweak the ‘setup_devices’ method of ‘training_args.py’ and write:

torch.distributed.init_process_group(backend=“nccl”) → torch.distributed.init_process_group(backend=“gloo”)

along with the ‘distributed_concat’ in ‘trainer_pt_utils.py’:

dist.all_gather(output_tensors, tensor) → dist.all_gather(output_tensors, tensor if len(tensor.shape) > 0 else tensor[None])

With those 2 changes I can run my script using DPP and GLOO backend on my 2GPU windows machine.
Following that I am now trying to run the script on 2 windows machines with 2 GPU each. From what I have understand I need to run the same script on both machine. Hence I used the following command on the 2 machines:

…torch\distributed\launch.py
–nproc_per_node=2
–nnodes=2
–node_rank=0
–master_addr=“10.73.8.68”
–master_port=1234
my_script.py

and …torch\distributed\launch.py

–nproc_per_node=2
–nnodes=2
–node_rank=1
–master_addr=“10.73.8.68”
–master_port=1234
my_script.py

When doing so script start to be executed on both machine (loading the dataset on both sides) but then nothing happens and the scripts don t seem to go beyond the instruction:

torch.distributed.init_process_group(backend=“gloo”)

in training_args.py and just never ends…
Do you have any idea if what I am doing makes sense and how I should do to run my script in DDP on 2 windows machines with 2 GPU (using GLOO backend) ?

Thank you.

1 Like

did you manage to solve your issue? Mind sharing? :slight_smile: