[Analysis] How alignment affects attention specialization across model families

Hi all,

I’m sharing results from a systematic empirical analysis of how
preference optimization (RLHF, DPO) affects attention head
specialization across different LLM families.

This is not a single-model case study: the analysis covers 25+ models
from 8 vendor families, uses a standardized protocol (bfloat16,
3-seed cross-validation), and includes explicit falsification checks.

Empirical result at a glance:
At matched scale, Grouped Query Attention (GQA) exhibits orders-of-magnitude
higher sensitivity to random attention noise than Multi-Head Attention (MHA).

What we consistently observe:

  • Sliding Window Attention (e.g. Mistral, Gemma-2) preserves or increases
    attention specialization under alignment, while comparable non-SWA
    models show large SI collapse.

  • Synthetic-data training (Phi family) yields near scale-invariant
    specialization (SI ≈ 0.33) across a ~10× parameter range.

  • Grouped Query Attention shows ~5,800× higher sensitivity to random
    attention noise than Multi-Head Attention at matched scale
    (ratio-of-means across three seeds), yet shows greater resilience
    under structured recursive alignment pressure.

These are empirical patterns, not causal claims.
The paper treats training > architecture > scale as a descriptive
hierarchy, not a mechanistic explanation.

To probe whether low specialization reflects suppression or optimization,
we introduce a simple perturbation-based diagnostic that distinguishes
pathological vs. healthy low-SI states via noise response.

I’m particularly interested in feedback on:

  • Alternative explanations for the observed SWA / synthetic-training effects
  • Failure modes or confounders I might have missed
  • Comparable diagnostics others have used for attention internals
  • Whether SI is a reasonable proxy for attention diversity at scale

Paper (Zenodo, CC-BY):

Code & data (MIT):

Happy to clarify details or share specific plots if useful.

1 Like