Transformers Attention Viz - Visualizing Cross-Modal Attention in CLIP/BLIP Models

sisird864 · July 21, 2025, 8:52am

Hey everyone!

I’ve been working on a tool to visualize attention patterns in multi-modal transformers, following guidance from the Transformers team to develop it as a standalone package first.

What it does

The tool extracts and visualizes attention weights from models like CLIP and BLIP to show how they connect text tokens with image regions. This helps understand:

Which words the model focuses on when processing text
How attention is distributed across image patches
Statistical patterns in attention (entropy, concentration)

Technical Implementation

Attention Extraction:

Hooks into transformer attention layers during forward pass
Handles different model architectures (CLIP vs BLIP) with adapter pattern
Extracts attention weights without modifying model behavior

Visualization Approach:

Heatmap shows full attention matrix between text/image tokens
Statistical analysis includes entropy calculation and Gini coefficient for concentration
Built on matplotlib/seaborn for static viz, Gradio for interactive dashboard

Example Output

Here’s what it shows for CLIP processing “a fluffy orange cat”:
[Insert your heatmap image]

The model correctly focuses 73% of attention on the “cat” token, with secondary attention on descriptive words.

Code Example

from transformers import CLIPModel, CLIPProcessor
from attention_viz import AttentionVisualizer

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
visualizer = AttentionVisualizer(model, processor)

# Get attention statistics
stats = visualizer.get_attention_stats(image, text)
print(f"Top attended tokens: {stats['top_tokens']}")

Current Limitations

Only heatmap visualization fully working (v0.1.0)
Flow and evolution visualizations have dimension mismatch bugs I’m debugging
Currently supports CLIP and BLIP, working on adding more models

Links

GitHub: https://github.com/YOUR_USERNAME/transformers-attention-viz
PyPI: pip install transformers-attention-viz
Colab Demo: [if you have one]

Questions for the Community

What other models would you like to see supported?
Are there specific attention patterns you’re interested in analyzing?
Any suggestions for the API design before potential Transformers integration?

Notes

Built with the goal of eventually integrating into Transformers core if there’s interest. Would love your feedback on the approach and implementation!
Technical details: Built on PyTorch hooks, tested with transformers 4.53.2, includes comprehensive test suite.

Topic		Replies	Views
How can one visualize the Cross-Attention of a VisionEncoderDecoderModel? 🤗Transformers	2	1962	November 7, 2023
Visualizing attention heatmaps of layoutlmv3 🤗Transformers	0	1107	February 25, 2023
Probing fine-tuned model Beginners	1	928	December 3, 2020
How to visualize attention of a large encoder-decoder transformer model that isn't a model on hugging face? 🤗Transformers	0	2324	June 28, 2021
Seq2Seq Trainer plot attention maps 🤗Transformers	0	445	July 18, 2022