Seeking Help with Attention Hooking in Dit-based T2I Models

In previous T2I models like Stable Diffusion 1.5, which were based on “self.unet,” we could use attention stores and attention processors to control the denoising of images by editing self- or cross-attention maps. For more details, refer to this resource.

Currently, I’m looking for assistance on how to hook into the attention mechanism in the following context:

noise_pred = self.transformer(
    latent_model_input, timestep=timesteps, class_labels=class_labels_input
).sample

Does anyone have insights on how to achieve this? I believe that a model based on a MMDIT architecture might benefit significantly from attention editing, potentially leading to smoother performance. Any guidance would be greatly appreciated!

1 Like

Hi, I am also looking for the cross attention map for DiT model, do you find solution now?

1 Like