Hi everyone,
I am training a GPT-2 model using Andrej Karpathy’s llm.c
repository (train_gpt2.py) on a single NVIDIA A40 GPU with 46GB of memory. I’m trying to maximize my GPU utilization to get the fastest possible training speed.
I’ve run into a situation that I can’t quite understand. I have tried two different configurations that should both result in an effective batch size of 512:
- Scenario 1: Batch Size =
64
, Gradient Accumulation Steps =8
- Scenario 2: Batch Size =
16
, Gradient Accumulation Steps =32
The issue is that both scenarios result in the same training speed (in tokens/second). I was expecting Scenario 1, with the larger per-step batch size, to be faster.
I would be grateful for any help with the following questions:
- Why might the training speed be the same, even though one setup uses a much larger batch size for each forward/backward pass?
- How can I investigate this further? Are there specific profilers or metrics I should be looking at to find the true bottleneck?