Per_device_train_batch_size in model parallelism

If I have two GPUs and use device_map="auto", by default the model evenly between them, how does setting per_device_train_batch_size affect the effective batch size? Specifically, is the effective batch size equal to per_device_train_batch_size, or is it 2 x per_device_train_batch_size? Is there a way to explicitly see the effective batch size

1 Like

I haven’t been able to find any materials that specifically mention the calculation formula or checking method, but I think this is probably correct.

or is it 2 x per_device_train_batch_size

So maybe this one.

# if using gradient accumulation
effective_batch_size = per_device_train_batch_size x gradient_accumulation_steps x num_gpus
# else
effective_batch_size = per_device_train_batch_size x num_gpus
1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.