Loss doesn't converge for latent diffusion model

I have been trying to train a latent diffusion model, but the loss doesn’t seem to converge. I took inspiration from this example and modified it to input latents instead of images.

Link to my kaggle notebook: Latent diffusion for Monets[Multi-gpu, high res] | Kaggle
Things i’ve tried so far:

  • model architecture: Since the latents input the unet are much smaller in size comapred to actual images for which the unet was built for, the middle layers might have very small size feature maps. And this might hamper the model’s ability to learn. So , i decreased some layers, didn’t work. In fact average loss increased.