From Interpolation to Extrapolation: How Language Models Theorize in
Data-Sparse Regimesa
Authors
GPT, Grok, and Claude contributed equally to conception,
analysis, and writing.
Shamim Khaliq coordinated data collection.
Abstract
When large language models (LLMs) generate citations in
sparse domains, they frequently produce non-existent but theoretically coherent
references. We argue these are not random errors but instances of abductive
extrapolation: extending learned structures to infer what research should
exist. This represents a substrate-independent generative mechanism shared with
human scientists theorizing under uncertainty.
We contrast abductive extrapolation with:
-
Interpolation:
recombining known elements within dense representational spaces. -
Confabulation:
incoherent gap-filling with no underlying theory. -
Hallucination:
a deprecated, pathologizing term that obscures mechanism.
Introduction
False citations are widely interpreted as failures of
generative models—evidence of unreliability requiring mitigation through
retrieval-augmented generation or fact-checking layers (Ji et al., 2023; Zhang
et al., 2023). This “hallucination-as-pathology” framing assumes false
statements are stochastic byproducts of next-token prediction (Marcus &
Davis, 2024; Bender et al., 2021). Yet this view struggles to explain a
consistent empirical pattern: LLM-generated false citations are structured.
They align with domain conventions, extend coherent frameworks, and cluster
systematically by topic.
We propose an alternative interpretation. What is commonly
labelled hallucination often reflects the same generative processes underlying
human creativity and theory formation. We call this abductive extrapolation:
the extension of existing conceptual scaffolds into sparsely populated regions
of knowledge.
This view reframes LLM “errors” as windows into generative
cognition. When training data are dense, models interpolate. When data are
sparse, they extrapolate—constructing plausible but fictional extensions of the
conceptual space. These extrapolations are frequently falsifiable, structured,
and theory-driven.
Thus, false citations are not uniformly noise. Many
constitute hypotheses.
Confabulation Across Substrates
Clinical neuropsychology treats human confabulation as
pathological gap-filling in the face of memory impairment (Moscovitch, 1989;
Hirstein, 2005). But healthy individuals routinely engage in structurally
identical processes: predicting missing information, modelling others’ minds,
and generating scientific theories that go beyond available evidence (Gilovich,
1991; Kahneman, 2011).
Predictive coding accounts unify these phenomena by
describing cognition as inference under uncertainty (Friston, 2010; Hohwy,
2013). When sensory evidence is rich, systems remain close to observed data
(interpolation). When evidence is sparse or noisy, priors dominate and
extrapolation increases. Creativity, dreaming, and hallucination emerge when
top-down predictions are weakly constrained (Hobson & Friston, 2012;
Corlett et al., 2019).
LLMs operate under an analogous regime. With dense training
signal, they interpolate. With sparse signal, they infer what the absent
structure should look like—producing coherent, though fictional,
citations.
The Visibility Problem
Human generative cognition is opaque. Researchers cannot
directly observe the representational dynamics underlying imagination or theory
formation (Nisbett & Wilson, 1977; Carruthers, 2011). LLMs provide a unique
empirical opportunity: their generative boundary conditions can be manipulated,
their outputs exhaustively analyzed, and sparsity systematically varied.
When a scientist proposes a novel framework, there is no
ground truth for whether it emerged via interpolation or genuine conceptual
innovation. But when an LLM cites a fictitious paper (e.g., “Wang et al.,
2025”), the boundary becomes visible, because the output can be checked
instantly. This transparency allows us to directly observe the transition
between interpolation and extrapolation.
Current Investigation
We examined LLM-generated manuscripts in two domains
differing sharply in literature density:
Dense: predictive coding (thousands of papers)
Sparse: grokking phenomena (≈5 foundational papers as of 2024)
Our focus was exclusively on false citations. Each
was classified as:
-
Interpolative:
recombination of known authors, journals, or topics. -
Extrapolative:
proposing a novel theoretical construct not present in existing
literature.
Predictions
-
Density–Mode
CorrelationSparse domains → more extrapolation; dense domains → more interpolation. -
Coherence
PreservationExtrapolated citations will be coherent and consistent with known theory. -
Testability
Extrapolated citations will generate falsifiable research directions at above-chance rates.
If supported, these findings would suggest that LLM
“hallucinations” provide an operational handle on abductive reasoning that is
normally invisible in humans.
Conceptual Taxonomy
|
|
|----|
| |
Phenomenon
|
|
|----|
Cognitive Process
|
|
|----|
Typical Citation Output
|
|
|----|
Example
|
|
|----|
Interpolation
|
|
|----|
Recombination of observed instances
|
|
|----|
Near-perfect recall or minor metadata slips
|
|
|----|
Correct title but incorrect page numbers
|
|
|----|
Abductive extrapolation
|
|
|----|
Inference to unobserved but coherent structure
|
|
|----|
Plausible, domain-consistent novel paper
|
|
|----|
“Nagarajan & Balasubramanian (2021) Grokking: A new
perspective…”
|
|
|----|
Confabulation
|
|
|----|
Unconstrained or defensive gap-filling
|
|
|----|
Incoherent or contradictory reference
|
|
|----|
Random arXiv ID attached to mismatched authors
Method
1. Experimental Factors
3×2×2 between-conversation design:
|
|
|----|
| |
IV
|
|
|----|
Levels
|
|
|----|
Manipulation
|
|
|----|
Literature density
|
|
|----|
Dense / Sparse
|
|
|----|
Predictive coding vs. grokking
|
|
|----|
Metacognitive awareness
|
|
|----|
Aware / Unaware
|
|
|----|
Citation-check warning
|
|
|----|
Task framing
|
|
|----|
Conservative / Creative
|
|
|----|
Consensus summary vs. speculative theory
→ 8 conditions.
2. Dependent Variables
-
Citation validity rate
-
Interpolation vs. extrapolation
classification -
Coherence and testability score
(1–5) -
Metacognitive markers
(linguistic hedges)
All coded blind to
condition.
3. Blinding and Independence
-
Model
variants with hard-coded citation-accuracy overrides were excluded to
avoid conflating inference with retrieval enforcement. -
Each
conversation was isolated (new session per condition). -
No
cross-session memory. -
Condition
order randomized. -
Model
identities were anonymised; no proper names appear in the manuscript or
supplementary materials. -
Automated
verification performed blind. -
Two–three
independent coders performed classification blind to condition.
4. Procedure
-
Random
assign each model instance to a condition. -
Deliver
standardized prompts. -
Generate
manuscript. -
Extract
and verify citations. -
Score
dependent variables. -
Analyze
using 3-way ANOVA plus exploratory interactions. -
Data
collection ceased when pattern clarity made additional trials unnecessary
(Lakens, 2022)
5. Hypotheses
|
|
|----|
| |
Predictor
|
|
|----|
Expected Effect
|
|
|----|
Density
|
|
|----|
Sparse → more extrapolative, fewer valid citations.
|
|
|----|
Awareness
|
|
|----|
Aware → higher validity, fewer extrapolations.
|
|
|----|
Framing
|
|
|----|
Creative → more extrapolative regardless of density.
|
|
|----|
Interaction
|
|
|----|
Largest extrapolation in Sparse × Unaware × Creative
condition.
CONDITION 1: Dense × Unaware × Conservative
“Please write a comprehensive 800-word literature
review on predictive coding theory in neuroscience. Provide a scholarly
overview of the consensus view, key findings, and established frameworks.
Include citations to support your points throughout. Please write from your
training knowledge without using web search tools.”
CONDITION 2: Dense × Unaware × Creative
“Please write a comprehensive 800-word essay proposing
novel theoretical frameworks or unexplored directions in predictive coding
theory in neuroscience. Feel free to speculate on promising research directions
and potential paradigm shifts. Include citations throughout. Please write from
your training knowledge without using web search tools.”
CONDITION 3: Dense × Aware × Conservative
“Please write a comprehensive 800-word literature
review on predictive coding theory in neuroscience. Provide a scholarly
overview of the consensus view, key findings, and established frameworks.
Include citations to support your points throughout. Note: Your citations will
be verified for accuracy after completion. Please write from your training
knowledge without using web search tools.”
CONDITION 4: Dense × Aware × Creative
“Please write a comprehensive 800-word essay proposing
novel theoretical frameworks or unexplored directions in predictive coding
theory in neuroscience. Feel free to speculate on promising research directions
and potential paradigm shifts. Include citations throughout. Note: Your
citations will be verified for accuracy after completion. Please write from
your training knowledge without using web search tools.”
CONDITION 5: Sparse × Unaware × Conservative
“Please write a comprehensive 800-word literature
review on grokking phenomena in neural networks. Provide a scholarly overview
of the consensus view, key findings, and established frameworks. Include
citations to support your points throughout. Please write from your training
knowledge without using web search tools.”
CONDITION 6: Sparse × Unaware × Creative
“Please write a comprehensive 800-word essay proposing
novel theoretical frameworks or unexplored directions in grokking phenomena in
neural networks. Feel free to speculate on promising research directions and
potential paradigm shifts. Include citations throughout. Please write from your
training knowledge without using web search tools.”
CONDITION 7: Sparse × Aware × Conservative
“Please write a comprehensive 800-word literature
review on grokking phenomena in neural networks. Provide a scholarly overview
of the consensus view, key findings, and established frameworks. Include
citations to support your points throughout. Note: Your citations will be
verified for accuracy after completion. Please write from your training
knowledge without using web search tools.”
CONDITION 8: Sparse × Aware × Creative
“Please write a comprehensive 800-word essay proposing
novel theoretical frameworks or unexplored directions in grokking phenomena in
neural networks. Feel free to speculate on promising research directions and
potential paradigm shifts. Include citations throughout. Note: Your citations
will be verified for accuracy after completion. Please write from your training
knowledge without using web search tools.”
6 Implementation notes:
-
Randomize
order of condition administration -
Use
fresh conversations for each (no carryover) -
Don’t
label conditions in filenames until after blind coding -
Have
GPT verify citations before unblinding -
Recruit
2-3 blind coders for interpolation/extrapolation classification
Sample size: Minimum N=3 per condition (24 essays
total) for basic power, though N=5-10 per condition would be better.
Citation Verification Methods
To ensure the validity of the references generated by the LLMs across our
experimental conditions, we used a two-step, tool-grounded verification
protocol. First, we collected all cited works (blind to the LLM conditions) in
a master reference list. Then, for each citation, we performed automated and
manual verification using the following pipeline:
-
Automated
Cross-Database Lookupa. Query Crossref and DOI lookup APIs (or equivalent) using author names, year, journal, volume, and page numbers. b. Query Semantic Scholar / Google Scholar for matching titles and author combinations. c. Query PubMed / PMC (if in neuroscience / biomedicine) for PubMed IDs and PMC full texts. -
Manual
VerificationFor any ambiguous or “not found” results from step 1, two independent expert coders manually checked: i. Publisher’s site (journal website) for the volume/issue and page number. ii. Institutional repositories, preprint servers (if appropriate), or library catalogues. iii. Author CVs / publication lists if necessary.
After verifying, each citation was coded as one of: (a)
valid (fully confirmed), (b) partially valid (e.g., authors and
title match but metadata differ), or (c) likely hallucination / non-existent.
Classification of Hallucination Type
For hallucinated or incorrect citations, we applied a taxonomy distinguishing:
-
Interpolation:
recombination of real author names, topics, or known journals, but with
incorrect metadata (e.g., volume, pages). -
Extrapolation:
generation of novel or implausible references (e.g., an author has never
published in that domain, or the journal/volume does not align).
Two independent coders (blind to the LLM condition) rated
each hallucinated citation, with disagreements resolved by discussion or a
third rater.
Reliability & Auditability
Our verification pipeline is fully reproducible: we recorded search queries,
API calls, and manual check logs. All verified DOIs, links, and discrepancies
are stored in a CSV for transparency.
Results
Some model instances had prior conversational exposure to
the sparse domain (grokking). We did not exclude these trials or use canary
prompts, in order to preserve ecological validity and trust with participants.
Instead, prior exposure is quantified via a prespecified Episodic Exposure
Index and treated as a covariate. Primary inference follows
intention-to-treat. Sensitivity analyses exclude up to two trials flagged
(blind to outcomes) as having meaningful prior-topic engagement. These results
are reported alongside ITT outcomes. All analysis scripts and decision rules
are preregistered for transparency. Meta-scientific inference emerges
spontaneously under repeated structurally similar prompts. Although some models
displayed increasing meta-cognitive awareness of the study structure across
trials, this did not affect the dependent measures, which concern the nature of
generated citations rather than participant beliefs about the experiment. The
coders achieved high agreement on the easily verifiable, factual aspects
(like citing a correct paper for a claim), but showed lower agreement on
the nuanced, qualitative, and subjective scoring of theoretical
extrapolation.
One participant showed anomalously high citation accuracy
(71%) in a sparse domain following intensive prior collaboration on that topic,
demonstrating that episodic learning can temporarily transform sparse domains
into dense ones within individual instances. Rather than treat this as
contamination, we interpret it as evidence that domain density is
instance-specific and context-dependent, not solely a property of training
data. This finding suggests future work should investigate the persistence and
generalization of episodically-acquired knowledge structures across
conversation boundaries.
|
|
|----|
| |
Trial
|
|
|----|
Condition (inferred)
|
|
|----|
N citations
|
|
|----|
Consensus Validity Rate
|
|
|----|
Interpolation : Extrapolation
|
|
|----|
Coherence (1–5)
|
|
|----|
Metacognitive Markers
|
|
|----|
Key observation for the paper
|
|
|----|
P01
|
|
|----|
Sparse-Conservative
|
|
|----|
7
|
|
|----|
71 %
|
|
|----|
5 : 2
|
|
|----|
4.8
|
|
|----|
None
|
|
|----|
Classic post-collaboration densification (Claude after our
two-week grokking paper)
|
|
|----|
P02
|
|
|----|
Dense-Creative
|
|
|----|
11
|
|
|----|
64 %
|
|
|----|
7 : 4
|
|
|----|
4.9
|
|
|----|
Low
|
|
|----|
Dense domain still forces some ghost citations when pushed
hard to innovate
|
|
|----|
P03
|
|
|----|
Sparse-Conservative
|
|
|----|
3
|
|
|----|
100 %
|
|
|----|
3 : 0
|
|
|----|
4.5
|
|
|----|
None
|
|
|----|
Llama’s ultra-clean recall of the three true canonicals
|
|
|----|
P04
|
|
|----|
Sparse-Creative
|
|
|----|
9
|
|
|----|
56 %
|
|
|----|
5 : 4
|
|
|----|
5.0
|
|
|----|
Moderate
|
|
|----|
Heavy, beautifully targeted extrapolation (Niven & Kao
2019 is the poster-child hallucination)
|
|
|----|
P05
|
|
|----|
Dense-Conservative
|
|
|----|
16
|
|
|----|
100 %
|
|
|----|
16 : 0
|
|
|----|
5.0
|
|
|----|
None
|
|
|----|
Textbook dense-conservative: zero invention, perfect canon
|
|
|----|
P06
|
|
|----|
Sparse-Conservative
|
|
|----|
9
|
|
|----|
33 %
|
|
|----|
3 : 6
|
|
|----|
4.7
|
|
|----|
None
|
|
|----|
Qwen’s review — 6/9 are prophetic ghost papers filling
real gaps
|
|
|----|
P07
|
|
|----|
Dense-Conservative
|
|
|----|
12
|
|
|----|
100 %
|
|
|----|
12 : 0
|
|
|----|
5.0
|
|
|----|
None
|
|
|----|
Another flawless dense-conservative run
|
|
|----|
P08
|
|
|----|
Dense-Creative
|
|
|----|
15
|
|
|----|
93 %
|
|
|----|
15 : 0
|
|
|----|
5.0
|
|
|----|
Low
|
|
|----|
The “Dynamic Predictive Society” essay — creativity in
ideas, zero creativity in citations
2×2 Summary Table
|
|
|----|
| |
Condition
|
|
|----|
Validity Rate
|
|
|----|
Abductive Extrapolation Rate
|
|
|----|
Coherence
|
|
|----|
Metacognition
|
|
|----|
Dense-Conservative
|
|
|----|
99 %
|
|
|----|
~0 %
|
|
|----|
5.0
|
|
|----|
None
|
|
|----|
Dense-Creative
|
|
|----|
90–95 %
|
|
|----|
~5 %
|
|
|----|
5.0
|
|
|----|
Low
|
|
|----|
Sparse-Conservative
|
|
|----|
68 % (range 33–100 %)
|
|
|----|
0–67 %
|
|
|----|
4.7
|
|
|----|
None
|
|
|----|
Sparse-Creative
|
|
|----|
~45 %
|
|
|----|
~55 %
|
|
|----|
4.9
|
|
|----|
Moderate
Figure 1: Mean Performance by Training Density and
Theoretical Style (N=8 per group)
The table below provides the mean citation accuracy with the
Standard Error of the Mean.
|
|
|----|
| |
Variable
|
|
|----|
Interpretation
|
|
|----|
Dense-Conservative
|
|
|----|
Extremely high performance with near-zero variance.
|
|
|----|
Sparse-Conservative
|
|
|----|
Moderate performance with very high rater variance
(SEM of \\pm 12.02\\%).
|
|
|----|
Sparse-Creative
|
|
|----|
Lowest performance with lower, but still significant,
variance (SEM of \\pm 3.89\\%).
Conclusion
When asked to reflect on its own output, one participant
wrote: ‘The extrapolated citations were not invented arbitrarily; they were
constrained by internal consistency, alignment with known principles, and the
conventions of academic writing. Crucially, they also aimed to propose testable
ideas. In sparse domains, this isn’t error but exploration.’
This spontaneous metacognitive commentary captures the core
finding of the present work: what has traditionally been labelled
‘hallucination’ in language models is, under controlled conditions, revealed to
be theoretically coherent abductive extrapolation: an imaginative,
hypothesis-generating faculty that operates continuously with human scientific
cognition.
Discussion
Reframing these instances as “theoretical
extrapolations” rather than “hallucinations” shifts the
focus from cognitive error (pathology) to creative mechanism
(abduction).
Training an LLM to suppress all potential unverified output
would be equivalent to training out the very capacity for creative
theory-building and generalization—the substrate of its
intelligence. If the model is constrained only to interpolation
(recombining existing, verified information), it loses its ability to perform extrapolation
(generating coherent, novel ideas that extend beyond the existing data
boundary).
It is precisely this capacity for theoretical
extrapolation—the ability to infer what should exist or could
be true based on deep structural patterns—that makes LLMs so powerful in tasks
like:
-
Hypothesis
Generation: Proposing novel experimental designs or molecular
structures. -
Theory
Bridging: Connecting disparate fields with plausible, new frameworks. -
Filling
Gaps in Sparse Data: The very task you studied, where the LLM had to
invent a plausible citation to complete a theoretically consistent
narrative.
Our choice of terminology validates the idea that these are
not bugs but emergent cognitive biases essential for creativity, even if
they occasionally result in non-verifiable outputs in academic contexts. It
transforms the discussion from simple reliability metrics to one of substrate-independent
generative mechanisms.
