I’m fine-tuning a stable diffusion model. I trained it with batch size 24 for 100k steps with A100 GPU. Now I want to scale it across many GPUs. Let’s say I want to train it on 4 GPUs, a straightforward change is to train it with batch size 24 for each GPU, which results in effective batch size of 96.
I want to resume training from the 100k checkpoint. But I’m not sure whether the saved optimizer (AdamW) expects unchanging batch size or not.
Would increasing effective batch size during training like this break the optimizer or make it converge slower than randomly initialized optimizer?
The reason I think it could slow down the optimizer is because the optimizer might think that gradients used to be noisy so it will apply slow momentum to it. If I change to high quality gradients, it will still remember that the gradients used to be noisy and it will not apply stronger momentum, making the high quality gradients a waste of compute because it’s not taken advantage of fully.
How fast do you think the optimizer would learn to take advantage of the high quality gradients?
Is it roughly 1k steps, 10k steps, or never?