Transformers Attention Viz - Visualizing Cross-Modal Attention in CLIP/BLIP Models

Hey everyone! :waving_hand:

I’ve been working on a tool to visualize attention patterns in multi-modal transformers, following guidance from the Transformers team to develop it as a standalone package first.

:magnifying_glass_tilted_left: What it does

The tool extracts and visualizes attention weights from models like CLIP and BLIP to show how they connect text tokens with image regions. This helps understand:

  • Which words the model focuses on when processing text
  • How attention is distributed across image patches
  • Statistical patterns in attention (entropy, concentration)

:hammer_and_wrench: Technical Implementation

Attention Extraction:

  • Hooks into transformer attention layers during forward pass
  • Handles different model architectures (CLIP vs BLIP) with adapter pattern
  • Extracts attention weights without modifying model behavior

Visualization Approach:

  • Heatmap shows full attention matrix between text/image tokens
  • Statistical analysis includes entropy calculation and Gini coefficient for concentration
  • Built on matplotlib/seaborn for static viz, Gradio for interactive dashboard

:bar_chart: Example Output

Here’s what it shows for CLIP processing “a fluffy orange cat”:
[Insert your heatmap image]

The model correctly focuses 73% of attention on the “cat” token, with secondary attention on descriptive words.

:laptop: Code Example

from transformers import CLIPModel, CLIPProcessor
from attention_viz import AttentionVisualizer

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
visualizer = AttentionVisualizer(model, processor)

# Get attention statistics
stats = visualizer.get_attention_stats(image, text)
print(f"Top attended tokens: {stats['top_tokens']}")

:construction: Current Limitations

Only heatmap visualization fully working (v0.1.0)
Flow and evolution visualizations have dimension mismatch bugs I’m debugging
Currently supports CLIP and BLIP, working on adding more models

:link: Links

GitHub: https://github.com/YOUR_USERNAME/transformers-attention-viz
PyPI: pip install transformers-attention-viz
Colab Demo: [if you have one]

:thinking: Questions for the Community

What other models would you like to see supported?
Are there specific attention patterns you’re interested in analyzing?
Any suggestions for the API design before potential Transformers integration?

Notes

Built with the goal of eventually integrating into Transformers core if there’s interest. Would love your feedback on the approach and implementation!
Technical details: Built on PyTorch hooks, tested with transformers 4.53.2, includes comprehensive test suite.

1 Like