In previous T2I models like Stable Diffusion 1.5, which were based on “self.unet,” we could use attention stores and attention processors to control the denoising of images by editing self- or cross-attention maps. For more details, refer to this resource.
Currently, I’m looking for assistance on how to hook into the attention mechanism in the following context:
noise_pred = self.transformer(
latent_model_input, timestep=timesteps, class_labels=class_labels_input
).sample
Does anyone have insights on how to achieve this? I believe that a model based on a MMDIT architecture might benefit significantly from attention editing, potentially leading to smoother performance. Any guidance would be greatly appreciated!