When training on single GPU the effective batch size is the batch size multiplied by gradient accumulation steps.
When multiple GPUs are used the we have to multiply the number of GPUs, batch size and gradient accumulation steps to get the effective batch size.
Is it the same for TPU? When I use 8 TPU cores instead of 1 does the effective batch size equal 8 times batch size times gradient accumulation steps? Or the batch size gets divided equally to the 8 TPU cores?
As an example suppose I give the trainer batch size 32, gradient accumulation steps 1 and number of TPU cores 8. Is the effective batch size 32 or 256?