I’m an independent researcher based in Dublin. Over the past
week I’ve been investigating whether LLMs hallucinate in
consistent, detectable patterns visible in their internal
activations before the wrong token is generated.
I found two things I haven’t seen named before:
1. Relation Dropout (small models)
When a small transformer hallucinates on factual recall,
attention to the semantic relation token (e.g. “capital”)
collapses in the final block — even when the entity token
(e.g. “germany”) is strongly attended to. 4/6 hallucination
cases showed this pattern clearly.
2. Last-Layer Suppression (GPT-2)
GPT-2 actually finds the correct answer in blocks 10-11 of
its residual stream (Paris peaks at 18%, Berlin at 35%,
Tokyo at 46%) — then Block 12 systematically kills it every
single time across 20,000 prompts. Average suppression layer:
exactly 12.0.
What I built:
- Trained a 806K parameter transformer from scratch
- Ran 20,000 prompts through GPT-2 on RTX 4060 in 7 minutes
- Built a 3-type hallucination taxonomy
- Released HallScan: pip install hallscan
- Published HallBench: 20,000 labeled examples on HuggingFace
- Wrote a 6-page LaTeX paper
Links:
GitHub: GitHub - TrazeMaG/hallucination-fingerprints: Identifying consistent internal activation patterns that precede hallucinations in transformer language models · GitHub
Dataset: Trazemag/hallbench · Datasets at Hugging Face
Tool: pip install hallscan
The ask:
I’m trying to submit to arXiv cs.CL but need an endorsement
as a first-time submitter. If anyone here is a qualified
cs.CL endorser and finds the work interesting, the
endorsement link is:
Happy to share the paper draft with anyone interested.