Hi all,
I’m sharing results from a systematic empirical analysis of how
preference optimization (RLHF, DPO) affects attention head
specialization across different LLM families.
This is not a single-model case study: the analysis covers 25+ models
from 8 vendor families, uses a standardized protocol (bfloat16,
3-seed cross-validation), and includes explicit falsification checks.
Empirical result at a glance:
At matched scale, Grouped Query Attention (GQA) exhibits orders-of-magnitude
higher sensitivity to random attention noise than Multi-Head Attention (MHA).
What we consistently observe:
-
Sliding Window Attention (e.g. Mistral, Gemma-2) preserves or increases
attention specialization under alignment, while comparable non-SWA
models show large SI collapse. -
Synthetic-data training (Phi family) yields near scale-invariant
specialization (SI ≈ 0.33) across a ~10× parameter range. -
Grouped Query Attention shows ~5,800× higher sensitivity to random
attention noise than Multi-Head Attention at matched scale
(ratio-of-means across three seeds), yet shows greater resilience
under structured recursive alignment pressure.
These are empirical patterns, not causal claims.
The paper treats training > architecture > scale as a descriptive
hierarchy, not a mechanistic explanation.
To probe whether low specialization reflects suppression or optimization,
we introduce a simple perturbation-based diagnostic that distinguishes
pathological vs. healthy low-SI states via noise response.
I’m particularly interested in feedback on:
- Alternative explanations for the observed SWA / synthetic-training effects
- Failure modes or confounders I might have missed
- Comparable diagnostics others have used for attention internals
- Whether SI is a reasonable proxy for attention diversity at scale
Paper (Zenodo, CC-BY):
Code & data (MIT):
Happy to clarify details or share specific plots if useful.
