Hello the community !
I’m currently working on a personal project using diffusion model to complete image reconstruction. The following is my failed idea, and I would like to ask the possible reasons why my idea didn’t work.
I used the [textual inversion] (Textual Inversion) scripts in diffusers. In the training process of a diffusion model, we use Unet to predict the noise added to an image. Naturally, I use the following code (modified from DDIM) to predict a pseudo x_0 with the noisy image and predicted noise from Unet.
def predict_x_0(scheduler, decoder, pred_noise, latent, timestep):
# timestep to cpu
timestep = timestep.cpu().item()
alpha_prod_t = scheduler.alphas_cumprod[timestep]
pred_z_0 = latent - (1 - alpha_prod_t) ** (0.5) * pred_noise / alpha_prod_t ** (0.5)
pred_x_0 = decoder.decode(pred_z_0).sample
return pred_x_0
Then, I compute the reconstruction loss and then add it into the backpropagation of loss
# predict x_0
pred_x_0 = predict_x_0(noise_scheduler, vae, model_pred, latents, timesteps)
# add reconstruction loss
recon_loss = F.mse_loss(pred_x_0, batch["pixel_values"].to(dtype=weight_dtype), reduction="mean")
inversion_loss = F.mse_loss(model_pred.float(), target.float(), reduction="mean")
ttl_loss = inversion_loss + recon_loss
accelerator.backward(ttl_loss)
However, the model generated less similar images comparing to the case without this reconstruction loss. (In validation process, I use the prompt “A <kodak_img> picture itself”, with the placeholder token as <kodak_img> during the training process)
I was wondering why my idea couldn’t work and would appreciate any kind advice on this matter !
Thanks in advance !