Distributed Training w/ Trainer

josephgatto · July 7, 2021, 4:21pm

Does anyone have an end-to-end example of how to do multi-gpu, multi-node distributed training using the trainer? I can’t seem to find one anywhere.

sgugger · July 7, 2021, 5:52pm

All the examples using the Trainer run in multi-gpu multi-node, you just have to use the PyTorch launcher to properly launch a multi-GPU multinode training.

josephgatto · July 7, 2021, 6:03pm

So is there no code adjustments that need to be made, only how the file is launched?

sgugger · July 7, 2021, 6:08pm

Yes, the Trainer will deal with all the rest by itself.

SSamDav · July 23, 2021, 8:33am

Hi I’m trying to run a multi-node training using the Trainer class, for that I run my script with the python -m torch.distributed.launch --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr="IP" \ --master_port=1234, however, the script doesn’t wait for the master node. Also when I run in the master node the script doesn’t wait for the child node. Should I set up any env variable? The only thing that I’m doing is passing the local_rank to the TrainingArgs.

Thanks for the help!

sgugger · July 26, 2021, 1:05pm

It’s hard to know what the problem could be without seeing the script you are launching.

Lee1990 · April 24, 2022, 9:02am

what do you mean by “use the PyTorch launcher to properly launch a multi-GPU multinode training” ?

nlp · May 31, 2022, 10:10am

It doesnt work with Longformer. Is this expected ?

nbroad · May 31, 2022, 1:31pm

https://pytorch.org/docs/stable/distributed.html#launch-utility

keval-sha · April 14, 2025, 8:27pm

Are there any examples of running a distributed training job using pytorch on k8s, specifically GKE?

theharshithh · May 20, 2025, 1:09pm

Use raylib for distributed training. here is a guide put up by verl for mutli-node training in ray.

dag332 · June 3, 2025, 6:14pm

if you want to do multi-node, multi-GPU training make sure to use the local_process_index instead of the process_index. Otherwise the device count doesn’t account for the GPUs on the separate node

from trl import SFTConfig, SFTTrainer
from accelerate import PartialState

device_string = PartialState().local_process_index
 
model = AutoModelForCausalLM.from_pretrained(
    ...
    device_map={'':device_string}
)

Topic		Replies	Views
Running a Trainer in DistributedDataParallel mode 🤗Transformers	1	1463	October 24, 2020
Multi gpu training 🤗Transformers	3	6042	April 24, 2022
Trainer API for data parallel on multi-node 🤗Transformers	4	183	February 6, 2025
Does the HF Trainer class support multi-node training? Beginners	4	2482	January 9, 2023
Boilerplate for Trainer using torch.distributed Beginners	4	2064	January 11, 2022

Distributed Training w/ Trainer

Related topics