I’ve been trying to reproduce the CompVis/lambdalabs pokemon fine-tuning results with the Diffusers fine-tuning training script. I’m finding that the results are drastically different; with the same hyperparameters (LR, batch size, gradient accum, etc), the Diffusers script is overfitting rapidly in the first few iterations, while the outputs of the CompVis script change much more gradually (as expected).
Example: prompt: “a red mouse”
CompVis
LR = 1e-4, batch_size = 2, grad_accum = 2, ~800 steps (2 epochs):
Diffusers
LR = 1e-4, batch_size = 2, grad_accum = 2, ~500 steps (3 epochs):

(There’s an additional, odd difference in how epochs are calculated. With 833 train images, CompVis epochs are 416 steps, vs. Diffusers script which takes both batch size and gradient accumulation into account (2 x 2 → 4), so an epoch is 833 / 4 = 208 steps - but this shouldn’t really matter. Results should still be similar)
I’m also finding that the Diffusers script results in fast changes in the output space even at LR = 1e-6 when compared to the CompVis script for similar iterations. What’s going on here?