Training heritage matters more than scale in transformer dynamics
Hi everyone,
I’ve been looking into why different transformer families behave so differently internally — even when they have similar size and architecture.
Short version:
Who trained the model often matters more than its size.
Key empirical pattern
Across 23+ language models from 7 labs, I measured layer-wise residual gain (how representations expand or dampen through depth). Three consistent patterns emerged:
- EleutherAI models (Pythia, GPT-NeoX) mostly dampen signal (G < 1)
- Meta / OpenAI models (OPT, LLaMA, GPT-2) consistently expand signal (G > 1)
- This holds across model sizes — same depth, same heads, opposite behavior
In practice:
Training heritage > geometry > scale
A depth constraint (“Kleiber-like law”)
For the Pythia family, the maximum stable residual gain scales with depth as:
G_max ≈ 10^(1/L)
Deeper models are forced toward thermodynamic neutrality.
This is not just curve fitting — the same constraint shows up in weight geometry.
Mechanistic signal
The ratio
||W_V|| / ||W_O||
predicts whether a model dampens or expands, with ~10× differences between labs.
This links microscopic weight structure to macroscopic dynamics.
Why this might matter
- Model selection: the training lab can matter more than parameter count
- Fine-tuning: RLHF changes magnitude, but cannot flip the sign (a dampener stays a dampener)
- Interpretability: suggests attention behaves like constrained information transport, not free mixing
Full preprint (Zenodo, DOI)
Thermodynamic Constraints in Transformer Architectures
Everything is reproducible; code and notebooks are linked there (github repo)
Context
Earlier work explored:
- why embeddings cluster differently (uniformity asymmetry)
- how layer-wise dynamics change sign across depth (phase-structured dynamics)
This paper stands on its own — the above just explains where the questions came from.
Open questions (feedback welcome)
- Has anyone seen similar lab-dependent effects in other internal metrics?
- Are there controlled studies (same data, different optimizers) that could test this?
- Do these patterns appear in vision or multimodal transformers?
If this holds up under scrutiny, I’m planning to post it on arXiv next (endorsement code: TFTB6N) — feedback from experienced folks would be very welcome.
— Davide
Independent researcher
