How to calculate the effective batch size on TPU?

When training on single GPU the effective batch size is the batch size multiplied by gradient accumulation steps.

When multiple GPUs are used the we have to multiply the number of GPUs, batch size and gradient accumulation steps to get the effective batch size.

Is it the same for TPU? When I use 8 TPU cores instead of 1 does the effective batch size equal 8 times batch size times gradient accumulation steps? Or the batch size gets divided equally to the 8 TPU cores?

As an example suppose I give the trainer batch size 32, gradient accumulation steps 1 and number of TPU cores 8. Is the effective batch size 32 or 256?

2 Likes

Using 8 TPU cores work exactly the same as using 8 GPUs, so the effective batch size is 256.

1 Like

Thanks for clarifying.