[Stable Diffusion] Error in "In Painting" pipeline

Hello!
Why in the “In Painting” pipeline the masking is done in the latents and not in the decoded VAE versions?

295 latents = (init_latents_proper * mask) + (latents * (1 - mask))

If this is correct, How the mask is mapped into the latent space? Are pixels original locations (clustered) reflected on expected positions in the latents space?

Should not be the algoritm the following?

  1. Retrieve the latents.
  2. Decode the latents.
  3. Mix both images (original and decoded version) using the mask.
  4. Encode the image to obtain the new latents.

Instead of:

  1. Retrieve the latents.
  2. Mix the original latents (with noise added corresponding to the timestep) with the latents.

Hi @CristoJV! As the diffusion process for Stable Diffusion works exclusively with the VAE latents, the masks received by the inpainting pipeline are getting reampled from 512x512 to 64x64 to mask the latents.

1 Like

Hi @anton-I! Thank you for replying.

Should the inpainting pipeline be upgraded to add a further step mixing the original image and the generated but in the image domain.

The issue is that, although the image latents are preserved after the masking, the VAE’s encoding and decoding functions produces losses. For example, after retrieving non-modified faces (not affected by the mask) they look a little bit uglier or distorted. Maybe if there is another further step that mix both images using the mask after the generation the results would improve without interfering the pipeline operation.

Hi @CristoJV, I’m using SD 1.5 and in my code I just added a post-processing step in order to mix the original untouched image within the result decoded from VAE and the original mask (not downscaled) and I get a better result.

By the way, the output from VAE also differs in saturation and brightness and a slightly difference between the inpainted area and the original image is noticeable.

I’m guessing that the encode-decode process from VAE make the image to loses their original properties.

An idea that I’ll try for sure is dilating a bit the original mask in order to keep other border information from the decoded latent and then blend the luminosity of the latent and the original image; with this trick I think that I could achieve a better result.

By the way I just agree with you that the process should be upgraded in order to get better results.

For what matters, to me the inpainting process with Diffusers is just not usable for processing images with people faces, the distortion is fairly aggressive.

1 Like

Same problem here. Has anyone found a fix?

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.