I’ve been doing some extensive tests between diffuser’s stable diffusion and AUTOMATIC1111’s and NMKD-SD-GUI implementations (which both wrap the CompVis/stable-diffusion repo). I wanted to report some observations and wondered if the community might be able to shed some light on the findings.
For DDIM, I see that the output using the same configuration (20 steps, 7.5 CFG, 0 seed), I get different output. (Images to follow in posts)
For LMS, I see that in txt2img the output is exactly the same (20 steps, 7.5 CFG, 0 seed), but when moving to img2img, the output is very different, and most notably, it seems there is some smoothing or something happening in diffusers that causes the output to lose crispness. (Images to follow in posts)
Looking at this, it leads me to believe there is some underlying change in the parameters being fed to the algorithm/architecture of the algorithm. I find it strange the LMS gives the same output for txt2img but different output for img2img and that leads me to believe potentially there is a change in the VAE part of the model architecture in diffusers to that of CompVis’. I’ve noticed that there quite noticeable differences between diffusers and the regular stable-diffusion inference model stable-diffusion/v1-inference.yaml at main · CompVis/stable-diffusion · GitHub
Would love it if someone more knowledgeable might be able to share some more light on this! Personally for img2img, I find that Automatic’s implementation looks alot crisper and more natural. It seems to break from the form of the original image a bit more.