How to create a UNet2DConditionModel

Hi everyone,

I’m a student and I’m trying to learn how to create diffusion models, but I found myself recently stuck with a problem.

I’m following this guide on how to train a diffusion model from scratch using a UNet2DModel. Now, I’m trying to change the UNet2DModel to a UNet2DConditionModel, so I can input the model with a description for every image in my dataset. I found a couple of resources on the internet on how to create a UNet2DConditionModel, but I only found examples using pre-trained models, and I want to train my UNet2DConditionModel from scratch like in the guide I mentioned using my dataset that contains images with a corresponding description.

Does anyone know how to achieve this? On how to create and train a UNet2DConditionModel from scratch so I can input the model with an image and its description.

I appreciate any help you can provide.