Hi there! I’m doing DDP style training with the help of trainer. I set up my accelerate config to use 4 GPUs. I have set these args for training:
per_device_train_batch_size=2, gradient_accumulation_steps=16
I want to know what will be the effective batch size in this case?
When I start the training it shows total 24024
steps. There are total 1025024
training examples. My intuition was that per_device_train_batch_size * gradient_accumulation_steps * 4
would be the effective batch size which comes out to be 128
. But on dividing 1025024
by 128
I don’t get 24024
. Is there anything that I’m missing?