[Stable Diffusion] Error in "In Painting" pipeline

Why in the “In Painting” pipeline the masking is done in the latents and not in the decoded VAE versions?

295 latents = (init_latents_proper * mask) + (latents * (1 - mask))

If this is correct, How the mask is mapped into the latent space? Are pixels original locations (clustered) reflected on expected positions in the latents space?

Should not be the algoritm the following?

  1. Retrieve the latents.
  2. Decode the latents.
  3. Mix both images (original and decoded version) using the mask.
  4. Encode the image to obtain the new latents.

Instead of:

  1. Retrieve the latents.
  2. Mix the original latents (with noise added corresponding to the timestep) with the latents.

Hi @CristoJV! As the diffusion process for Stable Diffusion works exclusively with the VAE latents, the masks received by the inpainting pipeline are getting reampled from 512x512 to 64x64 to mask the latents.

Hi @anton-I! Thank you for replying.

Should the inpainting pipeline be upgraded to add a further step mixing the original image and the generated but in the image domain.

The issue is that, although the image latents are preserved after the masking, the VAE’s encoding and decoding functions produces losses. For example, after retrieving non-modified faces (not affected by the mask) they look a little bit uglier or distorted. Maybe if there is another further step that mix both images using the mask after the generation the results would improve without interfering the pipeline operation.