Img2img How is training and inference different from text2img

I would like to use the same ideas used to generate the StableDiffusionImg2ImgPipeline in a slightly different domain, however I do not understand exactly the approach to image condition diffusion.

Currently I am training a diffusion model to generate image like results from partial image inputs. (think along the lines of complete this image with this partial info) Right now the approach is to use a UNet2DConditionModel() with the partial image encoded and passed to the encoder_hidden_states. I get decent results however I believe this would perform better using the partial images as the latent starting point as shown in the img2img code here and as descriped in the stable diffusion paper here where the conditioning can be concatenated to the latent vector (or replace it?) as the starting noise during inference.

However I am struggling to understand exactly how this works in both training and inference. In training does anything change from how you would train a text to image conditional model? My immediate reaction is that the noise step we are predicting in training would need to be conditioned on the final time step “T” noise being a sample image rather than pure gaussian noise, but I am unsure how you would generate this training data.

Then during inference, it seems like the img2img pipeline from the link above is just doing the normal denoising process but using the provided image a the starting latent noise rather than a randomly sampled latent vector that would be expected for text to image.

Is it really as simple as train a standard text to image diffusion model and then drop in an encoded image as a starting point during inference?

I appreciate the help.