Distributed Training w/ Trainer

Does anyone have an end-to-end example of how to do multi-gpu, multi-node distributed training using the trainer? I can’t seem to find one anywhere.

1 Like

All the examples using the Trainer run in multi-gpu multi-node, you just have to use the PyTorch launcher to properly launch a multi-GPU multinode training.

So is there no code adjustments that need to be made, only how the file is launched?

Yes, the Trainer will deal with all the rest by itself.

1 Like

Hi I’m trying to run a multi-node training using the Trainer class, for that I run my script with the python -m torch.distributed.launch --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr="IP" \ --master_port=1234, however, the script doesn’t wait for the master node. Also when I run in the master node the script doesn’t wait for the child node. Should I set up any env variable? The only thing that I’m doing is passing the local_rank to the TrainingArgs.

Thanks for the help!

It’s hard to know what the problem could be without seeing the script you are launching.

what do you mean by “use the PyTorch launcher to properly launch a multi-GPU multinode training” ?

It doesnt work with Longformer. Is this expected ?

https://pytorch.org/docs/stable/distributed.html#launch-utility