Add additional trainable layers to StableDiffusion for fine-tuning

Hi, I am trying to modify the default StableDiffusion model architecture to support different types of text input in place of a single text caption. For this I plan to encode these text inputs using CLIP and combine the resultant embeddings using Cross Attention.

  1. Where in the diffusers codebase would I need add this Cross Attention Layer/ modify existing layers and how would I be able to train the same?
  2. Is it possible to combine the training of this layer while fine-tuning the base Stable Diffusion model (Full Finetuning or with LoRA)?