GPT-2 Training Speed Unchanged with Different Batch Size & Grad Accumulation

tankista · June 28, 2025, 1:51pm

Hi everyone,

I am training a GPT-2 model using Andrej Karpathy’s llm.c repository (train_gpt2.py) on a single NVIDIA A40 GPU with 46GB of memory. I’m trying to maximize my GPU utilization to get the fastest possible training speed.

I’ve run into a situation that I can’t quite understand. I have tried two different configurations that should both result in an effective batch size of 512:

Scenario 1: Batch Size = 64, Gradient Accumulation Steps = 8
Scenario 2: Batch Size = 16, Gradient Accumulation Steps = 32

The issue is that both scenarios result in the same training speed (in tokens/second). I was expecting Scenario 1, with the larger per-step batch size, to be faster.

I would be grateful for any help with the following questions:

Why might the training speed be the same, even though one setup uses a much larger batch size for each forward/backward pass?
How can I investigate this further? Are there specific profilers or metrics I should be looking at to find the true bottleneck?

John6666 · June 28, 2025, 9:53pm

This might help.

github.com/huggingface/trl

Gradient accumulation yields worse results than the equivalent batch size

opened 03:26PM - 04 Oct 24 UTC

closed 07:33AM - 27 Nov 24 UTC

benjamin-marie

❓ question ⏳ needs more info

I expected a training configuration with per_device_train_batch_size=1 and gradi…ent_accumulation_steps=32 to yield the same (or similar) result to per_device_train_batch_size=32 and gradient_accumulation_steps=1 but that's not the case, the former is much worse. I ran several experiments with SmolLM-135M and Llama 3.2 1B, using always the same seed, and the results are consistent with this observation. ![image](https://github.com/user-attachments/assets/d0d8de5e-469b-4f79-922d-b72a1e7c3788) ![image](https://github.com/user-attachments/assets/452aa89a-737f-477f-bc34-09b744bf8e14) Maybe I misunderstand something here? My training code is in [this Colab notebook](https://colab.research.google.com/drive/17g7zVSvGragLiB6DlHGW-D3X0TyfxKHg?usp=sharing). I ran this notebook to draw the learning curves above, restarting the notebook between each training to avoid OOM. Note that I have the same observations with Qwen2.

Topic		Replies	Views
Batch size vs gradient accumulation Beginners	9	33906	November 28, 2024
How to choose optimal batch size for training LLMs? Intermediate	4	18697	December 18, 2023
Per_device_train_batch_size in model parallelism Beginners	2	36	April 7, 2025
Selecting batch_size and gradient_accumulation_steps when fine-tuning Models	1	2232	December 31, 2023
Switch batch size and gradient accumulation step values mid training Beginners	0	242	February 28, 2024

GPT-2 Training Speed Unchanged with Different Batch Size & Grad Accumulation

Related topics