Distributed Training w/ Trainer

Does anyone have an end-to-end example of how to do multi-gpu, multi-node distributed training using the trainer? I can’t seem to find one anywhere.

2 Likes

All the examples using the Trainer run in multi-gpu multi-node, you just have to use the PyTorch launcher to properly launch a multi-GPU multinode training.

1 Like

So is there no code adjustments that need to be made, only how the file is launched?

Yes, the Trainer will deal with all the rest by itself.

2 Likes

Hi I’m trying to run a multi-node training using the Trainer class, for that I run my script with the python -m torch.distributed.launch --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr="IP" \ --master_port=1234, however, the script doesn’t wait for the master node. Also when I run in the master node the script doesn’t wait for the child node. Should I set up any env variable? The only thing that I’m doing is passing the local_rank to the TrainingArgs.

Thanks for the help!

It’s hard to know what the problem could be without seeing the script you are launching.

what do you mean by “use the PyTorch launcher to properly launch a multi-GPU multinode training” ?

1 Like

It doesnt work with Longformer. Is this expected ?

https://pytorch.org/docs/stable/distributed.html#launch-utility

1 Like

Are there any examples of running a distributed training job using pytorch on k8s, specifically GKE?

1 Like

Use raylib for distributed training. here is a guide put up by verl for mutli-node training in ray.

1 Like

if you want to do multi-node, multi-GPU training make sure to use the local_process_index instead of the process_index. Otherwise the device count doesn’t account for the GPUs on the separate node

from trl import SFTConfig, SFTTrainer
from accelerate import PartialState

device_string = PartialState().local_process_index
 
model = AutoModelForCausalLM.from_pretrained(
    ...
    device_map={'':device_string}
)
1 Like