Which data parallel does trainer use? DP or DDP?

Hmm… Have you tried launching it via accelerate or torchrun?

# single node, 2 GPUs
torchrun --nproc_per_node=2 train.py
# or
accelerate launch --num_processes=2 train.py

Accelerator selection