Incorrect total train batch size when using tp_size > 1 and deepspeed

When I launch a job to run run_clm.py on two 8-GPU nodes with tp_size = 2, per_device_train_batch_size = 1 (set via TrainingArguments) and deepspeed zero2 (actually, it does not seem to matter which zero stage I choose), I got the following log which does not seem correct.

node032: [INFO|trainer.py:2414] 2025-05-20 22:25:24,210 >> ***** Running training *****
node032: [INFO|trainer.py:2415] 2025-05-20 22:25:24,210 >>   Num examples = 287
node032: [INFO|trainer.py:2416] 2025-05-20 22:25:24,210 >>   Num Epochs = 3
node032: [INFO|trainer.py:2417] 2025-05-20 22:25:24,210 >>   Instantaneous batch size per device = 1
node032: [INFO|trainer.py:2420] 2025-05-20 22:25:24,210 >>   Total train batch size (w. parallel, distributed & accumulation) = 16
node032: [INFO|trainer.py:2421] 2025-05-20 22:25:24,210 >>   Gradient Accumulation steps = 1
node032: [INFO|trainer.py:2422] 2025-05-20 22:25:24,210 >>   Total optimization steps = 108
node032: [INFO|trainer.py:2423] 2025-05-20 22:25:24,211 >>   Number of trainable parameters = 3,085,938,688
node028: [INFO|trainer.py:2414] 2025-05-20 22:25:27,491 >> ***** Running training *****
node028: [INFO|trainer.py:2415] 2025-05-20 22:25:27,491 >>   Num examples = 287
node028: [INFO|trainer.py:2416] 2025-05-20 22:25:27,491 >>   Num Epochs = 3
node028: [INFO|trainer.py:2417] 2025-05-20 22:25:27,491 >>   Instantaneous batch size per device = 1
node028: [INFO|trainer.py:2420] 2025-05-20 22:25:27,491 >>   Total train batch size (w. parallel, distributed & accumulation) = 16
node028: [INFO|trainer.py:2421] 2025-05-20 22:25:27,491 >>   Gradient Accumulation steps = 1
node028: [INFO|trainer.py:2422] 2025-05-20 22:25:27,491 >>   Total optimization steps = 108
node028: [INFO|trainer.py:2423] 2025-05-20 22:25:27,492 >>   Number of trainable parameters = 3,085,938,688

Based on my understanding, with 16 GPUs in total and tp_size = 2 and per_device_train_batch_size = 1, the total train batch size should be 8 instead of 16, which is quite weird to me.

The training can successfully run without error though.

How does this happen? Is there anything that I could do wrong?

My environment:

transformers==4.51.3
accelerate==1.6.0
deepspeed==0.16.7
torch==2.6.0+cu124
1 Like

Different but similar issue.