Replace text encoder with a different encoder in Stable Diffusion


I am trying to train Stable Diffusion for image generation, and instead of default text encoder, I have a function that takes the input prompt and returns its embedding which is supposed to be used as model condition (instead of prompt encoding).

I was able to train the model by modifying the training script provided here: diffusers/examples/text_to_image/ at main 路 huggingface/diffusers 路 GitHub.

When I use StableDiffusionPipeline.from_pretrained, the generated images look like they are learning the pattern of training samples, however, as I understand the loaded checkpoint parameters does not take into account the proper condition embedding that model needs and just loads default text encoder saved by default. The reason is that there is no input argument that I can give the condition encoder as input of the StableDiffusionPipeline.from_pretrained().

Can someone pls let me know if this is possible to incorporate the condition in this setting for inference (or if not how should I modify it), and any guidance regarding making it work would be highly appreciate.

Please let me know if any clarification is needed.