Hello,
I am desperately trying to run my training script in DDP using GLOO (on windows). I managed to do it on a single machine having 2 GPUs running the following command:
launch.py
–nproc_per_node=2
my_script.py
As NLCC is not available on windows I had to tweak the ‘setup_devices’ method of ‘training_args.py’ and write:
torch.distributed.init_process_group(backend=“nccl”) → torch.distributed.init_process_group(backend=“gloo”)
along with the ‘distributed_concat’ in ‘trainer_pt_utils.py’:
dist.all_gather(output_tensors, tensor) → dist.all_gather(output_tensors, tensor if len(tensor.shape) > 0 else tensor[None])
With those 2 changes I can run my script using DPP and GLOO backend on my 2GPU windows machine.
Following that I am now trying to run the script on 2 windows machines with 2 GPU each. From what I have understand I need to run the same script on both machine. Hence I used the following command on the 2 machines:
…torch\distributed\launch.py
–nproc_per_node=2
–nnodes=2
–node_rank=0
–master_addr=“10.73.8.68”
–master_port=1234
my_script.py
and …torch\distributed\launch.py
–nproc_per_node=2
–nnodes=2
–node_rank=1
–master_addr=“10.73.8.68”
–master_port=1234
my_script.py
When doing so script start to be executed on both machine (loading the dataset on both sides) but then nothing happens and the scripts don t seem to go beyond the instruction:
torch.distributed.init_process_group(backend=“gloo”)
in training_args.py and just never ends…
Do you have any idea if what I am doing makes sense and how I should do to run my script in DDP on 2 windows machines with 2 GPU (using GLOO backend) ?
Thank you.