If I have two GPUs and use device_map="auto"
, by default the model evenly between them, how does setting per_device_train_batch_size
affect the effective batch size? Specifically, is the effective batch size equal to per_device_train_batch_size
, or is it 2 x per_device_train_batch_size
? Is there a way to explicitly see the effective batch size
1 Like
I haven’t been able to find any materials that specifically mention the calculation formula or checking method, but I think this is probably correct.
or is it 2 x
per_device_train_batch_size
So maybe this one.
# if using gradient accumulation
effective_batch_size = per_device_train_batch_size x gradient_accumulation_steps x num_gpus
# else
effective_batch_size = per_device_train_batch_size x num_gpus
1 Like
This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.