opened 03:26PM - 04 Oct 24 UTC
closed 07:33AM - 27 Nov 24 UTC
❓ question
⏳ needs more info
I expected a training configuration with per_device_train_batch_size=1 and gradi…ent_accumulation_steps=32 to yield the same (or similar) result to per_device_train_batch_size=32 and gradient_accumulation_steps=1 but that's not the case, the former is much worse.
I ran several experiments with SmolLM-135M and Llama 3.2 1B, using always the same seed, and the results are consistent with this observation.


Maybe I misunderstand something here?
My training code is in [this Colab notebook](https://colab.research.google.com/drive/17g7zVSvGragLiB6DlHGW-D3X0TyfxKHg?usp=sharing). I ran this notebook to draw the learning curves above, restarting the notebook between each training to avoid OOM.
Note that I have the same observations with Qwen2.