Running a Trainer in DistributedDataParallel mode

deppen8 · October 23, 2020, 7:16pm

I am trying to train a model on four GPUs (AWS ml.p3.8xlarge). As far as I can tell, to get my model to train in DistributedDataParallel, I only need to specify some integer value for local_rank. But my understanding is that this will only distribute the training across a single GPU (whichever I specify with local_rank).

What is the proper way to launch DistributedDataParallel training across all four GPUs using a Trainer? Do I have to launch something via the command line (as hinted at here https://github.com/huggingface/transformers/issues/1651)?

valhalla · October 24, 2020, 8:24am

Hi @deppen8
Yes, you’ll need to use torch.distributed.launch for distributed training.
See this command for an example.

Topic		Replies	Views
Trainer API for Model Parallelism on Multiple GPUs 🤗Transformers	5	4143	September 10, 2024
Distributed Training w/ Trainer 🤗Transformers	11	8933	June 3, 2025
Minimal changes for using DataParallel? Beginners	1	289	June 17, 2024
Multi gpu training 🤗Transformers	3	6013	April 24, 2022
Can't use DistributedDataParallel for training the EncoderDecoderModel 🤗Transformers	2	5476	October 27, 2020

Running a Trainer in DistributedDataParallel mode

Related topics