Training heritage can matter more than scale in transformer dynamics

Training heritage matters more than scale in transformer dynamics

Hi everyone,

I’ve been looking into why different transformer families behave so differently internally — even when they have similar size and architecture.

Short version:
Who trained the model often matters more than its size.


Key empirical pattern

Across 23+ language models from 7 labs, I measured layer-wise residual gain (how representations expand or dampen through depth). Three consistent patterns emerged:

  • EleutherAI models (Pythia, GPT-NeoX) mostly dampen signal (G < 1)
  • Meta / OpenAI models (OPT, LLaMA, GPT-2) consistently expand signal (G > 1)
  • This holds across model sizes — same depth, same heads, opposite behavior

In practice:

Training heritage > geometry > scale


A depth constraint (“Kleiber-like law”)

For the Pythia family, the maximum stable residual gain scales with depth as:

G_max ≈ 10^(1/L)

Deeper models are forced toward thermodynamic neutrality.
This is not just curve fitting — the same constraint shows up in weight geometry.


Mechanistic signal

The ratio

||W_V|| / ||W_O||

predicts whether a model dampens or expands, with ~10× differences between labs.
This links microscopic weight structure to macroscopic dynamics.


Why this might matter

  • Model selection: the training lab can matter more than parameter count
  • Fine-tuning: RLHF changes magnitude, but cannot flip the sign (a dampener stays a dampener)
  • Interpretability: suggests attention behaves like constrained information transport, not free mixing

Full preprint (Zenodo, DOI)

Thermodynamic Constraints in Transformer Architectures

Everything is reproducible; code and notebooks are linked there (github repo)


Context

Earlier work explored:

  • why embeddings cluster differently (uniformity asymmetry)
  • how layer-wise dynamics change sign across depth (phase-structured dynamics)

This paper stands on its own — the above just explains where the questions came from.


Open questions (feedback welcome)

  • Has anyone seen similar lab-dependent effects in other internal metrics?
  • Are there controlled studies (same data, different optimizers) that could test this?
  • Do these patterns appear in vision or multimodal transformers?

If this holds up under scrutiny, I’m planning to post it on arXiv next (endorsement code: TFTB6N) — feedback from experienced folks would be very welcome.

Davide
Independent researcher

1 Like