Run training script in DDP using GLOO

kiroul · March 6, 2022, 1:27pm

Hello,

I am desperately trying to run my training script in DDP using GLOO (on windows). I managed to do it on a single machine having 2 GPUs running the following command:

launch.py
–nproc_per_node=2
my_script.py

As NLCC is not available on windows I had to tweak the ‘setup_devices’ method of ‘training_args.py’ and write:

torch.distributed.init_process_group(backend=“nccl”) → torch.distributed.init_process_group(backend=“gloo”)

along with the ‘distributed_concat’ in ‘trainer_pt_utils.py’:

dist.all_gather(output_tensors, tensor) → dist.all_gather(output_tensors, tensor if len(tensor.shape) > 0 else tensor[None])

With those 2 changes I can run my script using DPP and GLOO backend on my 2GPU windows machine.
Following that I am now trying to run the script on 2 windows machines with 2 GPU each. From what I have understand I need to run the same script on both machine. Hence I used the following command on the 2 machines:

…torch\distributed\launch.py
–nproc_per_node=2
–nnodes=2
–node_rank=0
–master_addr=“10.73.8.68”
–master_port=1234
my_script.py

and …torch\distributed\launch.py

–nproc_per_node=2
–nnodes=2
–node_rank=1
–master_addr=“10.73.8.68”
–master_port=1234
my_script.py

When doing so script start to be executed on both machine (loading the dataset on both sides) but then nothing happens and the scripts don t seem to go beyond the instruction:

torch.distributed.init_process_group(backend=“gloo”)

in training_args.py and just never ends…
Do you have any idea if what I am doing makes sense and how I should do to run my script in DDP on 2 windows machines with 2 GPU (using GLOO backend) ?

Thank you.

brando · August 17, 2022, 3:02pm

did you manage to solve your issue? Mind sharing?

Topic		Replies	Views
What algorithm Trainer uses for multi GPU training (without torchrun) Beginners	1	910	January 19, 2023
Distributed training on just cpu on a single node 🤗Transformers	0	164	November 21, 2023
Distributed Training w/ Trainer 🤗Transformers	11	8947	June 3, 2025
Model's evaluation in DDP training is using only one GPU Beginners	1	1045	September 14, 2023
Trainer is not using multiple GPUs in the DP setup Beginners	0	816	April 9, 2023

Run training script in DDP using GLOO

Related topics