Img2img How is training and inference different from text2img

alre5639 · October 4, 2023, 4:47pm

I would like to use the same ideas used to generate the StableDiffusionImg2ImgPipeline in a slightly different domain, however I do not understand exactly the approach to image condition diffusion.

Currently I am training a diffusion model to generate image like results from partial image inputs. (think along the lines of complete this image with this partial info) Right now the approach is to use a UNet2DConditionModel() with the partial image encoded and passed to the encoder_hidden_states. I get decent results however I believe this would perform better using the partial images as the latent starting point as shown in the img2img code here and as descriped in the stable diffusion paper here where the conditioning can be concatenated to the latent vector (or replace it?) as the starting noise during inference.

However I am struggling to understand exactly how this works in both training and inference. In training does anything change from how you would train a text to image conditional model? My immediate reaction is that the noise step we are predicting in training would need to be conditioned on the final time step “T” noise being a sample image rather than pure gaussian noise, but I am unsure how you would generate this training data.

Then during inference, it seems like the img2img pipeline from the link above is just doing the normal denoising process but using the provided image a the starting latent noise rather than a randomly sampled latent vector that would be expected for text to image.

Is it really as simple as train a standard text to image diffusion model and then drop in an encoded image as a starting point during inference?

I appreciate the help.

Topic		Replies	Views
A couple of super basic questions 🧨 Diffusers	3	1634	November 7, 2022
I have a question about giving Image condition at diffusion models Intermediate	0	577	December 11, 2023
Stable diffusion img2img: Continue from saved image 🧨 Diffusers	3	3274	September 23, 2022
Notable differences between other implementations of stable diffusion, particularly in the img2img pipeline 🧨 Diffusers	10	4877	July 3, 2023
Replace text encoder with a different encoder in Stable Diffusion 🧨 Diffusers	0	1429	February 9, 2024

Img2img How is training and inference different from text2img

Related topics