Bigger batch size, the lower throughput and GPU usage？

pipi · December 2, 2021, 9:45am

I’m training bert-base on single node with 8xA100 GPU, using run_mlm.py script.

when the batch size set to 256, the throughput is 8000 sample/s, and GPU usage is 80%.
when the batch size set to 384, the throughput is 4000 sample/s, and GPU usage is 50%.

What is the reason for this phenomenon？Data IO became bottleneck?

andrewvk · July 16, 2022, 12:10am

Hi, I’m having a similar issue with ViT but I am getting very inconsistent and low GPU usage. Did you figure this out? I think it is IO as well

Topic		Replies	Views
How to specify different batch sizes for different GPUs when training with rum_mlm.py? Beginners	1	1103	July 26, 2021
Using Batch Encodings 🤗Transformers	0	689	July 12, 2022
Training Loss Sudden Spike After 8 Hours of pre-training a BERT Model 🤗Transformers	0	1126	September 13, 2023
Seeking Advice on Optimizing Hardware Resources for Model Training Beginners	3	153	August 4, 2024
Speed expectations for production BERT models on CPU vs GPU? Beginners	1	2154	October 2, 2020

Bigger batch size, the lower throughput and GPU usage？

Related topics