Notable differences between other implementations of stable diffusion, particularly in the img2img pipeline

Hey there!

I’ve been doing some extensive tests between diffuser’s stable diffusion and AUTOMATIC1111’s and NMKD-SD-GUI implementations (which both wrap the CompVis/stable-diffusion repo). I wanted to report some observations and wondered if the community might be able to shed some light on the findings.

For DDIM, I see that the output using the same configuration (20 steps, 7.5 CFG, 0 seed), I get different output. (Images to follow in posts)

For LMS, I see that in txt2img the output is exactly the same (20 steps, 7.5 CFG, 0 seed), but when moving to img2img, the output is very different, and most notably, it seems there is some smoothing or something happening in diffusers that causes the output to lose crispness. (Images to follow in posts)

Looking at this, it leads me to believe there is some underlying change in the parameters being fed to the algorithm/architecture of the algorithm. I find it strange the LMS gives the same output for txt2img but different output for img2img and that leads me to believe potentially there is a change in the VAE part of the model architecture in diffusers to that of CompVis’. I’ve noticed that there quite noticeable differences between diffusers and the regular stable-diffusion inference model stable-diffusion/v1-inference.yaml at main · CompVis/stable-diffusion · GitHub

Would love it if someone more knowledgeable might be able to share some more light on this! Personally for img2img, I find that Automatic’s implementation looks alot crisper and more natural. It seems to break from the form of the original image a bit more.

2 Likes

Here are image references, due to new user status, I wasn’t able to add them all in a single post.

For DDIM, I see that the output using the same configuration (20 steps, 7.5 CFG, 0 seed), I get different output. Shown below

DIFFUSERS DDIM txt2img:

AUTOMATIC1111 DDIM txt2img:

For LMS, I see that in txt2img the output is exactly the same (20 steps, 7.5 CFG, 0 seed), but when moving to img2img, the output is very different, and most notably, it seems there is some smoothing or something happening in diffusers that causes the output to lose crispness.

DIFFUSERS LMS txt2img:

Here as we can see, the output is exactly the same as Diffusers which is what I would expect for all the schedulers

AUTOMATIC1111 DDIM txt2img:

For img2img, LMS does not have the same result. (20 steps, 7.5 CFG, 0 seed, 0.5 Strength)

DIFFUSERS LMS img2img:

AUTOMATIC1111 LMS img2img:

Very different, and seems to me Automatic has a more resolved image

sorry, there was a type here - it should “AUTOMATIC1111 LMS txt2img”

1 Like

I have the same problem, I’m using DDIM sampler with the StableDiffusion CompVis repo and getting different results then with loading the model with the diffusers library.
I went over both model configs and couldn’t find any difference. I also made sure the same model ckpt is used.
Would really appreciate some help on that!

2 Likes

same issue here of getting different outputs with diffusers and CompVis models under the same config.