Which data parallel does trainer use? DP or DDP?

I try to search in the doc. But I didn’t find the answer anywhere.

Thank you

2 Likes

It depends if you launch your training script with python (in which case it will use DP) or python -m torch.distributed.launch (in which case it will use DDP).

4 Likes

perhaps useful to you: Using Transformers with DistributedDataParallel — any examples?

3 Likes

I know this is a bit of an old thread, but I have a follow up question. I’m creating a Trainer() , evaluating, training and evaluating again. Here’s a snippet of my code:

```
trainer = Trainer(
model=model,
processing_class=tokenizer,
args=pretraining_config,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=data_collator,
)

logging.info(“Evaluating before training…”)
eval_metrics_before = trainer.evaluate()
wandb.log({f"eval_before/{k}": v for k, v in eval_metrics_before.items()})
pprint.pprint(eval_metrics_before)

logging.info(“Beginning training…”)
trainer.train()

logging.info(“Finished training. Beginning final evaluation…”)
eval_metrics_after = trainer.evaluate()
wandb.log({f"eval_after/{k}": v for k, v in eval_metrics_after.items()})
pprint.pprint(eval_metrics_after)
```

When I run with two GPUs and a model small enough to fit on each, I noticed while the job is running that evaluating appears to use data parallelism over the two visible GPUs, but does not for training. Do you know what might cause that or how to fix it?

1 Like

Hmm… Have you tried launching it via accelerate or torchrun?

# single node, 2 GPUs
torchrun --nproc_per_node=2 train.py
# or
accelerate launch --num_processes=2 train.py

Accelerator selection

Yeah, I would’ve thought that launching with python would use DP and thus would only use 1 available GPU. And that’s partially correct: train() indeed only uses 1 GPU, but evaluate() uses 2 GPUs. Hence my confusion…

1 Like

I see. When running distributed training, if you launch it as a single process, evaluate sometimes behaves differently from the Trainer part…Since DP itself seems quite fragile, using DDP is probably the simpler approach…