Which data parallel does trainer use? DP or DDP?

xiami · March 24, 2022, 6:03am

I try to search in the doc. But I didn’t find the answer anywhere.

Thank you

sgugger · March 24, 2022, 12:22pm

It depends if you launch your training script with python (in which case it will use DP) or python -m torch.distributed.launch (in which case it will use DDP).

brando · August 17, 2022, 3:03pm

perhaps useful to you: Using Transformers with DistributedDataParallel — any examples?

RylanSchaeffer · August 30, 2025, 1:34am

I know this is a bit of an old thread, but I have a follow up question. I’m creating a Trainer() , evaluating, training and evaluating again. Here’s a snippet of my code:

```
trainer = Trainer(
model=model,
processing_class=tokenizer,
args=pretraining_config,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=data_collator,
)

logging.info(“Evaluating before training…”)
eval_metrics_before = trainer.evaluate()
wandb.log({f"eval_before/{k}": v for k, v in eval_metrics_before.items()})
pprint.pprint(eval_metrics_before)

logging.info(“Beginning training…”)
trainer.train()

logging.info(“Finished training. Beginning final evaluation…”)
eval_metrics_after = trainer.evaluate()
wandb.log({f"eval_after/{k}": v for k, v in eval_metrics_after.items()})
pprint.pprint(eval_metrics_after)
```

When I run with two GPUs and a model small enough to fit on each, I noticed while the job is running that evaluating appears to use data parallelism over the two visible GPUs, but does not for training. Do you know what might cause that or how to fix it?

John6666 · August 30, 2025, 2:42am

Hmm… Have you tried launching it via accelerate or torchrun?

# single node, 2 GPUs
torchrun --nproc_per_node=2 train.py
# or
accelerate launch --num_processes=2 train.py

Accelerator selection

RylanSchaeffer · August 30, 2025, 4:23am

Yeah, I would’ve thought that launching with python would use DP and thus would only use 1 available GPU. And that’s partially correct: train() indeed only uses 1 GPU, but evaluate() uses 2 GPUs. Hence my confusion…

John6666 · August 30, 2025, 5:25am

I see. When running distributed training, if you launch it as a single process, evaluate sometimes behaves differently from the Trainer part…Since DP itself seems quite fragile, using DDP is probably the simpler approach…

Topic		Replies	Views
Running a Trainer in DistributedDataParallel mode 🤗Transformers	1	1472	October 24, 2020
Using Transformers with DistributedDataParallel — any examples? Intermediate	11	23723	May 8, 2023
Model's evaluation in DDP training is using only one GPU Beginners	1	1079	September 14, 2023
Multi gpu training 🤗Transformers	3	6058	April 24, 2022
Trainer is not using multiple GPUs in the DP setup Beginners	0	835	April 9, 2023

Which data parallel does trainer use? DP or DDP?

Accelerator selection

Related topics