Add additional trainable layers to StableDiffusion for fine-tuning

aashananth · October 8, 2023, 10:40pm

Hi, I am trying to modify the default StableDiffusion model architecture to support different types of text input in place of a single text caption. For this I plan to encode these text inputs using CLIP and combine the resultant embeddings using Cross Attention.

Where in the diffusers codebase would I need add this Cross Attention Layer/ modify existing layers and how would I be able to train the same?
Is it possible to combine the training of this layer while fine-tuning the base Stable Diffusion model (Full Finetuning or with LoRA)?

Topic		Replies	Views
Replace text encoder with a different encoder in Stable Diffusion 🧨 Diffusers	0	1427	February 9, 2024
Add additional conditioning info 🧨 Diffusers	21	8268	March 3, 2025
Finetuning Latent Upscale 🧨 Diffusers	1	505	December 6, 2023
How to increase quality of fine-tuned text-to-image LoRa? 🧨 Diffusers	0	1249	November 12, 2023
Pass additional information into Key and Value weights of Stable Diffusion 🧨 Diffusers	1	1059	January 29, 2024

Add additional trainable layers to StableDiffusion for fine-tuning

Related topics