Found consistent internal activation patterns preceding hallucinations in GPT-2 — Relation Dropout and Last-Layer Suppression

I’m an independent researcher based in Dublin. Over the past
week I’ve been investigating whether LLMs hallucinate in
consistent, detectable patterns visible in their internal
activations before the wrong token is generated.

I found two things I haven’t seen named before:

1. Relation Dropout (small models)
When a small transformer hallucinates on factual recall,
attention to the semantic relation token (e.g. “capital”)
collapses in the final block — even when the entity token
(e.g. “germany”) is strongly attended to. 4/6 hallucination
cases showed this pattern clearly.

2. Last-Layer Suppression (GPT-2)
GPT-2 actually finds the correct answer in blocks 10-11 of
its residual stream (Paris peaks at 18%, Berlin at 35%,
Tokyo at 46%) — then Block 12 systematically kills it every
single time across 20,000 prompts. Average suppression layer:
exactly 12.0.

What I built:

  • Trained a 806K parameter transformer from scratch
  • Ran 20,000 prompts through GPT-2 on RTX 4060 in 7 minutes
  • Built a 3-type hallucination taxonomy
  • Released HallScan: pip install hallscan
  • Published HallBench: 20,000 labeled examples on HuggingFace
  • Wrote a 6-page LaTeX paper

Links:
GitHub: GitHub - TrazeMaG/hallucination-fingerprints: Identifying consistent internal activation patterns that precede hallucinations in transformer language models · GitHub
Dataset: Trazemag/hallbench · Datasets at Hugging Face
Tool: pip install hallscan

The ask:
I’m trying to submit to arXiv cs.CL but need an endorsement
as a first-time submitter. If anyone here is a qualified
cs.CL endorser and finds the work interesting, the
endorsement link is:

Happy to share the paper draft with anyone interested.