How to calculate the effective batch size on TPU?

ibraheemmoosa · September 1, 2021, 7:55am

When training on single GPU the effective batch size is the batch size multiplied by gradient accumulation steps.

When multiple GPUs are used the we have to multiply the number of GPUs, batch size and gradient accumulation steps to get the effective batch size.

Is it the same for TPU? When I use 8 TPU cores instead of 1 does the effective batch size equal 8 times batch size times gradient accumulation steps? Or the batch size gets divided equally to the 8 TPU cores?

As an example suppose I give the trainer batch size 32, gradient accumulation steps 1 and number of TPU cores 8. Is the effective batch size 32 or 256?

sgugger · September 1, 2021, 12:25pm

Using 8 TPU cores work exactly the same as using 8 GPUs, so the effective batch size is 256.

ibraheemmoosa · September 1, 2021, 12:37pm

Thanks for clarifying.

Topic		Replies	Views
Per_device_train_batch_size in model parallelism Beginners	2	36	April 7, 2025
Selecting batch_size and gradient_accumulation_steps when fine-tuning Models	1	2220	December 31, 2023
Batch size in trainer eval loop DeepSpeed	3	4551	April 22, 2022
Trainer with adaptive batch size? Beginners	0	1035	September 29, 2023
Is it possible to see what batch size is being used in deepspeed training with auto batch size? 🤗Accelerate	1	593	July 14, 2023

How to calculate the effective batch size on TPU?

Related topics