GPT-2 Training Speed Unchanged with Different Batch Size & Grad Accumulation

Hi everyone,

I am training a GPT-2 model using Andrej Karpathy’s llm.c repository (train_gpt2.py) on a single NVIDIA A40 GPU with 46GB of memory. I’m trying to maximize my GPU utilization to get the fastest possible training speed.

I’ve run into a situation that I can’t quite understand. I have tried two different configurations that should both result in an effective batch size of 512:

  • Scenario 1: Batch Size = 64, Gradient Accumulation Steps = 8
  • Scenario 2: Batch Size = 16, Gradient Accumulation Steps = 32

The issue is that both scenarios result in the same training speed (in tokens/second). I was expecting Scenario 1, with the larger per-step batch size, to be faster.

I would be grateful for any help with the following questions:

  1. Why might the training speed be the same, even though one setup uses a much larger batch size for each forward/backward pass?
  2. How can I investigate this further? Are there specific profilers or metrics I should be looking at to find the true bottleneck?
1 Like

This might help.