Per_device_train_batch_size in model parallelism

quocnguyen · April 7, 2025, 12:27am

If I have two GPUs and use device_map="auto", by default the model evenly between them, how does setting per_device_train_batch_size affect the effective batch size? Specifically, is the effective batch size equal to per_device_train_batch_size, or is it 2 x per_device_train_batch_size? Is there a way to explicitly see the effective batch size

John6666 · April 7, 2025, 7:47am

I haven’t been able to find any materials that specifically mention the calculation formula or checking method, but I think this is probably correct.

or is it 2 x per_device_train_batch_size

So maybe this one.

# if using gradient accumulation
effective_batch_size = per_device_train_batch_size x gradient_accumulation_steps x num_gpus
# else
effective_batch_size = per_device_train_batch_size x num_gpus

system · April 17, 2025, 11:34am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Regarding the argument `per_device_train_batch_size` Beginners	0	71	July 2, 2024
How to calculate effective batch size while using DDP? Beginners	2	27	July 21, 2025
What is my batch size..? 🤗Accelerate	2	2506	April 29, 2024
How to calculate the effective batch size on TPU? Beginners	2	2186	September 1, 2021
GPT-2 Training Speed Unchanged with Different Batch Size & Grad Accumulation Beginners	1	30	June 28, 2025

Per_device_train_batch_size in model parallelism

Related topics