DDP running out of memory but DP is successful for the same per_device_train_batch_size

I am using trainer to finetune a Seq2Seq model on a single node with 8 A100 GPUs.
I start the DDP training with torchrun --nproc_per_node 8 <training_file.py>
The per_device_train_batch_size is set to batch_size//num_gpus
For a batch size of 128, this means a per device batch size of 16 (I reconfirmed through logging that num_gpus is 8).
The GPU is still running out of memory after around 20% of the training.
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.42 GiB (GPU 7; 79.21 GiB total capacity; 57.13 GiB already allocated; 576.56 MiB free; 61.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF'

If I start the training just with python <training_file.py>, I would expect it to do a DP instead of DDP based on Huggingface Documentation. I keep the per_device_train_batch_size same as batch_size//num_gpus. The training is successful (However I see most memory usage on only one GPU while other 7 GPUs have very low memory usage. I am not able to understand this behaviour for DP).

Why am I getting out of memory errors for DDP?

Code:

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, cache_dir=cache_dir)
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint, cache_dir=cache_dir,torch_dtype=datatype, device_map=“auto”)
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

args = Seq2SeqTrainingArguments(
output_dir=f"/out_dir",
evaluation_strategy=“epoch”,
learning_rate=learning_rate,
per_device_train_batch_size=batch_size//num_gpus,
per_device_eval_batch_size=batch_size//num_gpus,
weight_decay=0.01,
save_total_limit=1,
num_train_epochs=num_train_epochs,
gradient_accumulation_steps=4,
predict_with_generate=True,
logging_steps=logging_steps,
ddp_find_unused_parameters=False,
push_to_hub=False,
)

trainer = Seq2SeqTrainer(
model,
args,
train_dataset=tokenized_datasets[“train”],
eval_dataset=tokenized_datasets[“validation”].select(range(32)),
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)