Discrepancies between CompVis and Diffuser fine-tuning?

john-sungjin · November 4, 2022, 7:44pm

I’ve been trying to reproduce the CompVis/lambdalabs pokemon fine-tuning results with the Diffusers fine-tuning training script. I’m finding that the results are drastically different; with the same hyperparameters (LR, batch size, gradient accum, etc), the Diffusers script is overfitting rapidly in the first few iterations, while the outputs of the CompVis script change much more gradually (as expected).

Example: prompt: “a red mouse”
CompVis
LR = 1e-4, batch_size = 2, grad_accum = 2, ~800 steps (2 epochs):

Diffusers
LR = 1e-4, batch_size = 2, grad_accum = 2, ~500 steps (3 epochs):
redmouse2

(There’s an additional, odd difference in how epochs are calculated. With 833 train images, CompVis epochs are 416 steps, vs. Diffusers script which takes both batch size and gradient accumulation into account (2 x 2 → 4), so an epoch is 833 / 4 = 208 steps - but this shouldn’t really matter. Results should still be similar)

I’m also finding that the Diffusers script results in fast changes in the output space even at LR = 1e-6 when compared to the CompVis script for similar iterations. What’s going on here?

john-sungjin · November 4, 2022, 7:50pm

Couldn’t add another link to my original post, but here’s a link to the lambdalabs fine tuning repo/instructions:

khu · November 4, 2022, 9:51pm

I’ve been getting the exact same error—have you tried checking the dependency versions?

alexjwang · November 4, 2022, 10:01pm

I’m also running into the same error, but I don’t think this is a dependency problem. If anyone figures this out, I’m following this thread/it would be greatly appreciated!

khu · November 4, 2022, 10:03pm

@alexjwang It’s definitely a dependency problem—I’m 200% sure of it.

patrickvonplaten · November 7, 2022, 8:44pm

Here everybody, sorry we’re not yet using the forum extensively to answer questions and prefer to GitHub since we only a few maintainers at the moment. It would be amazing if you could follow some answers here: Why does train_text_to_image.py perform so differently from the CompVis script? · Issue #1153 · huggingface/diffusers · GitHub

Topic		Replies	Views
Tiny fine tune messed up the pretrained sd v1-4 model 🧨 Diffusers	2	514	April 8, 2023
Diffusers text-to-image finetuning example fails on multi-node 🧨 Diffusers	2	708	March 30, 2023
Notable differences between other implementations of stable diffusion, particularly in the img2img pipeline 🧨 Diffusers	10	4930	July 3, 2023
Why does textual inversion example scale learning rate? 🧨 Diffusers	5	1349	May 22, 2023
Learning rate schedule with `train_text_to_image.py` 🧨 Diffusers	6	1626	March 17, 2023

Discrepancies between CompVis and Diffuser fine-tuning?

Related topics