Is it a bad idea to increase batch size during training?

offchan · May 4, 2023, 1:20am

I’m fine-tuning a stable diffusion model. I trained it with batch size 24 for 100k steps with A100 GPU. Now I want to scale it across many GPUs. Let’s say I want to train it on 4 GPUs, a straightforward change is to train it with batch size 24 for each GPU, which results in effective batch size of 96.
I want to resume training from the 100k checkpoint. But I’m not sure whether the saved optimizer (AdamW) expects unchanging batch size or not.
Would increasing effective batch size during training like this break the optimizer or make it converge slower than randomly initialized optimizer?

The reason I think it could slow down the optimizer is because the optimizer might think that gradients used to be noisy so it will apply slow momentum to it. If I change to high quality gradients, it will still remember that the gradients used to be noisy and it will not apply stronger momentum, making the high quality gradients a waste of compute because it’s not taken advantage of fully.
How fast do you think the optimizer would learn to take advantage of the high quality gradients?
Is it roughly 1k steps, 10k steps, or never?

williamberman · May 22, 2023, 6:15pm

This is an interesting question! I do not think increasing the batch size while using the optimizer states from the previous training run would be an issue. I think you could either go with the old optimizer states or just start with fresh gradient statistics both would work imo.

Just worst case if you’re limited on compute, you could start with a small learning rate for a bit and then increase once you’re confident you have better gradient statistics. I haven’t seen this as a formal recommendation anywhere but it’s more just like a worse case might as well

DavidFM43 · July 15, 2023, 12:45am

A tweet on the topic: https://twitter.com/Yampeleg/status/1674034884652802048

Topic		Replies	Views
Selecting batch_size and gradient_accumulation_steps when fine-tuning Models	1	2225	December 31, 2023
Switch batch size and gradient accumulation step values mid training Beginners	0	240	February 28, 2024
What is the limit of grad accumulation? Intermediate	2	2913	May 4, 2021
Batch size vs gradient accumulation Beginners	9	33746	November 28, 2024
Per_device_train_batch_size in model parallelism Beginners	2	36	April 7, 2025

Is it a bad idea to increase batch size during training?

Related topics