Is it a bad idea to increase batch size during training?

I’m fine-tuning a stable diffusion model. I trained it with batch size 24 for 100k steps with A100 GPU. Now I want to scale it across many GPUs. Let’s say I want to train it on 4 GPUs, a straightforward change is to train it with batch size 24 for each GPU, which results in effective batch size of 96.
I want to resume training from the 100k checkpoint. But I’m not sure whether the saved optimizer (AdamW) expects unchanging batch size or not.
Would increasing effective batch size during training like this break the optimizer or make it converge slower than randomly initialized optimizer?

The reason I think it could slow down the optimizer is because the optimizer might think that gradients used to be noisy so it will apply slow momentum to it. If I change to high quality gradients, it will still remember that the gradients used to be noisy and it will not apply stronger momentum, making the high quality gradients a waste of compute because it’s not taken advantage of fully.
How fast do you think the optimizer would learn to take advantage of the high quality gradients?
Is it roughly 1k steps, 10k steps, or never?

This is an interesting question! I do not think increasing the batch size while using the optimizer states from the previous training run would be an issue. I think you could either go with the old optimizer states or just start with fresh gradient statistics both would work imo.

Just worst case if you’re limited on compute, you could start with a small learning rate for a bit and then increase once you’re confident you have better gradient statistics. I haven’t seen this as a formal recommendation anywhere but it’s more just like a worse case might as well :slight_smile:

3 Likes

A tweet on the topic: https://twitter.com/Yampeleg/status/1674034884652802048

4 Likes