GPT-2 Training Speed Unchanged with Different Batch Size & Grad Accumulation

John6666 · June 28, 2025, 9:53pm

This might help.

github.com/huggingface/trl

Gradient accumulation yields worse results than the equivalent batch size

opened 03:26PM - 04 Oct 24 UTC

closed 07:33AM - 27 Nov 24 UTC

benjamin-marie

❓ question ⏳ needs more info

I expected a training configuration with per_device_train_batch_size=1 and gradi…ent_accumulation_steps=32 to yield the same (or similar) result to per_device_train_batch_size=32 and gradient_accumulation_steps=1 but that's not the case, the former is much worse. I ran several experiments with SmolLM-135M and Llama 3.2 1B, using always the same seed, and the results are consistent with this observation. ![image](https://github.com/user-attachments/assets/d0d8de5e-469b-4f79-922d-b72a1e7c3788) ![image](https://github.com/user-attachments/assets/452aa89a-737f-477f-bc34-09b744bf8e14) Maybe I misunderstand something here? My training code is in [this Colab notebook](https://colab.research.google.com/drive/17g7zVSvGragLiB6DlHGW-D3X0TyfxKHg?usp=sharing). I ran this notebook to draw the learning curves above, restarting the notebook between each training to avoid OOM. Note that I have the same observations with Qwen2.

Topic		Replies	Views
How to choose optimal batch size for training LLMs? Intermediate	4	19404	December 18, 2023
Batch size vs gradient accumulation Beginners	9	36410	November 28, 2024
Selecting batch_size and gradient_accumulation_steps when fine-tuning Models	1	2323	December 31, 2023
Using gradient_accumulation_steps does not give the same results 🤗Accelerate	0	523	February 18, 2023
Question about Gradient Accumulation step in Trainer 🤗Transformers	2	2668	September 10, 2021

GPT-2 Training Speed Unchanged with Different Batch Size & Grad Accumulation

Related topics