Batch size vs gradient accumulation


I have a basic theoretical question. Which one is better for the model and GPU usage?

First option:

--per_device_train_batch_size 8 
--gradient_accumulation_steps 2

Second option:
--per_device_train_batch_size 16

If the second one does not OOM, you should have better performance with it. The first is a way to get around the memory error the second would give you.

Both commands are completely equivalent in terms of training done otherwise.

1 Like