Discrepancies between CompVis and Diffuser fine-tuning?

I’ve been trying to reproduce the CompVis/lambdalabs pokemon fine-tuning results with the Diffusers fine-tuning training script. I’m finding that the results are drastically different; with the same hyperparameters (LR, batch size, gradient accum, etc), the Diffusers script is overfitting rapidly in the first few iterations, while the outputs of the CompVis script change much more gradually (as expected).

Example: prompt: “a red mouse”
LR = 1e-4, batch_size = 2, grad_accum = 2, ~800 steps (2 epochs):

LR = 1e-4, batch_size = 2, grad_accum = 2, ~500 steps (3 epochs):

(There’s an additional, odd difference in how epochs are calculated. With 833 train images, CompVis epochs are 416 steps, vs. Diffusers script which takes both batch size and gradient accumulation into account (2 x 2 → 4), so an epoch is 833 / 4 = 208 steps - but this shouldn’t really matter. Results should still be similar)

I’m also finding that the Diffusers script results in fast changes in the output space even at LR = 1e-6 when compared to the CompVis script for similar iterations. What’s going on here?


Couldn’t add another link to my original post, but here’s a link to the lambdalabs fine tuning repo/instructions:

I’ve been getting the exact same error—have you tried checking the dependency versions?

I’m also running into the same error, but I don’t think this is a dependency problem. If anyone figures this out, I’m following this thread/it would be greatly appreciated!

1 Like

@alexjwang It’s definitely a dependency problem—I’m 200% sure of it.

Here everybody, sorry we’re not yet using the forum extensively to answer questions and prefer to GitHub since we only a few maintainers at the moment. It would be amazing if you could follow some answers here: Why does train_text_to_image.py perform so differently from the CompVis script? · Issue #1153 · huggingface/diffusers · GitHub