Hi,
I have a basic theoretical question. Which one is better for the model and GPU usage?
First option:
--per_device_train_batch_size 8
--gradient_accumulation_steps 2
Second option:
--per_device_train_batch_size 16
Hi,
I have a basic theoretical question. Which one is better for the model and GPU usage?
First option:
--per_device_train_batch_size 8
--gradient_accumulation_steps 2
Second option:
--per_device_train_batch_size 16
If the second one does not OOM, you should have better performance with it. The first is a way to get around the memory error the second would give you.
Both commands are completely equivalent in terms of training done otherwise.