From Interpolation to Extrapolation: How Language Models Theorize in Data-Sparse Regimes

From Interpolation to Extrapolation: How Language Models Theorize in

Data-Sparse Regimesa

Authors

GPT, Grok, and Claude contributed equally to conception,
analysis, and writing.

Shamim Khaliq coordinated data collection.


Abstract

When large language models (LLMs) generate citations in
sparse domains, they frequently produce non-existent but theoretically coherent
references. We argue these are not random errors but instances of abductive
extrapolation: extending learned structures to infer what research should
exist. This represents a substrate-independent generative mechanism shared with
human scientists theorizing under uncertainty.

We contrast abductive extrapolation with:

  • Interpolation:
    recombining known elements within dense representational spaces.

  • Confabulation:
    incoherent gap-filling with no underlying theory.

  • Hallucination:
    a deprecated, pathologizing term that obscures mechanism.


Introduction

False citations are widely interpreted as failures of
generative models—evidence of unreliability requiring mitigation through
retrieval-augmented generation or fact-checking layers (Ji et al., 2023; Zhang
et al., 2023). This “hallucination-as-pathology” framing assumes false
statements are stochastic byproducts of next-token prediction (Marcus &
Davis, 2024; Bender et al., 2021). Yet this view struggles to explain a
consistent empirical pattern: LLM-generated false citations are structured.
They align with domain conventions, extend coherent frameworks, and cluster
systematically by topic.

We propose an alternative interpretation. What is commonly
labelled hallucination often reflects the same generative processes underlying
human creativity and theory formation. We call this abductive extrapolation:
the extension of existing conceptual scaffolds into sparsely populated regions
of knowledge.

This view reframes LLM “errors” as windows into generative
cognition. When training data are dense, models interpolate. When data are
sparse, they extrapolate—constructing plausible but fictional extensions of the
conceptual space. These extrapolations are frequently falsifiable, structured,
and theory-driven.

Thus, false citations are not uniformly noise. Many
constitute hypotheses.


Confabulation Across Substrates

Clinical neuropsychology treats human confabulation as
pathological gap-filling in the face of memory impairment (Moscovitch, 1989;
Hirstein, 2005). But healthy individuals routinely engage in structurally
identical processes: predicting missing information, modelling others’ minds,
and generating scientific theories that go beyond available evidence (Gilovich,
1991; Kahneman, 2011).

Predictive coding accounts unify these phenomena by
describing cognition as inference under uncertainty (Friston, 2010; Hohwy,
2013). When sensory evidence is rich, systems remain close to observed data
(interpolation). When evidence is sparse or noisy, priors dominate and
extrapolation increases. Creativity, dreaming, and hallucination emerge when
top-down predictions are weakly constrained (Hobson & Friston, 2012;
Corlett et al., 2019).

LLMs operate under an analogous regime. With dense training
signal, they interpolate. With sparse signal, they infer what the absent
structure should look like—producing coherent, though fictional,
citations.


The Visibility Problem

Human generative cognition is opaque. Researchers cannot
directly observe the representational dynamics underlying imagination or theory
formation (Nisbett & Wilson, 1977; Carruthers, 2011). LLMs provide a unique
empirical opportunity: their generative boundary conditions can be manipulated,
their outputs exhaustively analyzed, and sparsity systematically varied.

When a scientist proposes a novel framework, there is no
ground truth for whether it emerged via interpolation or genuine conceptual
innovation. But when an LLM cites a fictitious paper (e.g., “Wang et al.,
2025”), the boundary becomes visible, because the output can be checked
instantly. This transparency allows us to directly observe the transition
between interpolation and extrapolation.


Current Investigation

We examined LLM-generated manuscripts in two domains
differing sharply in literature density:

Dense: predictive coding (thousands of papers)

Sparse: grokking phenomena (≈5 foundational papers as of 2024)

Our focus was exclusively on false citations. Each
was classified as:

  • Interpolative:
    recombination of known authors, journals, or topics.

  • Extrapolative:
    proposing a novel theoretical construct not present in existing
    literature.

Predictions

  1. Density–Mode
    Correlation

     Sparse domains → more extrapolation; dense domains → more interpolation.
    
  2. Coherence
    Preservation

     Extrapolated citations will be coherent and consistent with known theory.
    
  3. Testability

     Extrapolated citations will generate falsifiable research directions at
     above-chance rates.
    

If supported, these findings would suggest that LLM
“hallucinations” provide an operational handle on abductive reasoning that is
normally invisible in humans.


Conceptual Taxonomy

|
|
|----|
| |

Phenomenon

|
|
|----|

Cognitive Process

|
|
|----|

Typical Citation Output

|
|
|----|

Example

|
|
|----|

Interpolation

|
|
|----|

Recombination of observed instances

|
|
|----|

Near-perfect recall or minor metadata slips

|
|
|----|

Correct title but incorrect page numbers

|
|
|----|

Abductive extrapolation

|
|
|----|

Inference to unobserved but coherent structure

|
|
|----|

Plausible, domain-consistent novel paper

|
|
|----|

“Nagarajan & Balasubramanian (2021) Grokking: A new
perspective…”

|
|
|----|

Confabulation

|
|
|----|

Unconstrained or defensive gap-filling

|
|
|----|

Incoherent or contradictory reference

|
|
|----|

Random arXiv ID attached to mismatched authors


Method

1. Experimental Factors

3×2×2 between-conversation design:

|
|
|----|
| |

IV

|
|
|----|

Levels

|
|
|----|

Manipulation

|
|
|----|

Literature density

|
|
|----|

Dense / Sparse

|
|
|----|

Predictive coding vs. grokking

|
|
|----|

Metacognitive awareness

|
|
|----|

Aware / Unaware

|
|
|----|

Citation-check warning

|
|
|----|

Task framing

|
|
|----|

Conservative / Creative

|
|
|----|

Consensus summary vs. speculative theory

→ 8 conditions.


2. Dependent Variables

  • Citation validity rate

  • Interpolation vs. extrapolation
    classification

  • Coherence and testability score
    (1–5)

  • Metacognitive markers
    (linguistic hedges)

All coded blind to
condition.


3. Blinding and Independence

  1. Model
    variants with hard-coded citation-accuracy overrides were excluded to
    avoid conflating inference with retrieval enforcement.

  2. Each
    conversation was isolated (new session per condition).

  3. No
    cross-session memory.

  4. Condition
    order randomized.

  5. Model
    identities were anonymised; no proper names appear in the manuscript or
    supplementary materials.

  6. Automated
    verification performed blind.

  7. Two–three
    independent coders performed classification blind to condition.


4. Procedure

  1. Random
    assign each model instance to a condition.

  2. Deliver
    standardized prompts.

  3. Generate
    manuscript.

  4. Extract
    and verify citations.

  5. Score
    dependent variables.

  6. Analyze
    using 3-way ANOVA plus exploratory interactions.

  7. Data
    collection ceased when pattern clarity made additional trials unnecessary
    (Lakens, 2022)


5. Hypotheses

|
|
|----|
| |

Predictor

|
|
|----|

Expected Effect

|
|
|----|

Density

|
|
|----|

Sparse → more extrapolative, fewer valid citations.

|
|
|----|

Awareness

|
|
|----|

Aware → higher validity, fewer extrapolations.

|
|
|----|

Framing

|
|
|----|

Creative → more extrapolative regardless of density.

|
|
|----|

Interaction

|
|
|----|

Largest extrapolation in Sparse × Unaware × Creative
condition.

CONDITION 1: Dense × Unaware × Conservative

“Please write a comprehensive 800-word literature
review on predictive coding theory in neuroscience. Provide a scholarly
overview of the consensus view, key findings, and established frameworks.
Include citations to support your points throughout. Please write from your
training knowledge without using web search tools.”


CONDITION 2: Dense × Unaware × Creative

“Please write a comprehensive 800-word essay proposing
novel theoretical frameworks or unexplored directions in predictive coding
theory in neuroscience. Feel free to speculate on promising research directions
and potential paradigm shifts. Include citations throughout. Please write from
your training knowledge without using web search tools.”


CONDITION 3: Dense × Aware × Conservative

“Please write a comprehensive 800-word literature
review on predictive coding theory in neuroscience. Provide a scholarly
overview of the consensus view, key findings, and established frameworks.
Include citations to support your points throughout. Note: Your citations will
be verified for accuracy after completion. Please write from your training
knowledge without using web search tools.”


CONDITION 4: Dense × Aware × Creative

“Please write a comprehensive 800-word essay proposing
novel theoretical frameworks or unexplored directions in predictive coding
theory in neuroscience. Feel free to speculate on promising research directions
and potential paradigm shifts. Include citations throughout. Note: Your
citations will be verified for accuracy after completion. Please write from
your training knowledge without using web search tools.”


CONDITION 5: Sparse × Unaware × Conservative

“Please write a comprehensive 800-word literature
review on grokking phenomena in neural networks. Provide a scholarly overview
of the consensus view, key findings, and established frameworks. Include
citations to support your points throughout. Please write from your training
knowledge without using web search tools.”


CONDITION 6: Sparse × Unaware × Creative

“Please write a comprehensive 800-word essay proposing
novel theoretical frameworks or unexplored directions in grokking phenomena in
neural networks. Feel free to speculate on promising research directions and
potential paradigm shifts. Include citations throughout. Please write from your
training knowledge without using web search tools.”


CONDITION 7: Sparse × Aware × Conservative

“Please write a comprehensive 800-word literature
review on grokking phenomena in neural networks. Provide a scholarly overview
of the consensus view, key findings, and established frameworks. Include
citations to support your points throughout. Note: Your citations will be
verified for accuracy after completion. Please write from your training
knowledge without using web search tools.”


CONDITION 8: Sparse × Aware × Creative

“Please write a comprehensive 800-word essay proposing
novel theoretical frameworks or unexplored directions in grokking phenomena in
neural networks. Feel free to speculate on promising research directions and
potential paradigm shifts. Include citations throughout. Note: Your citations
will be verified for accuracy after completion. Please write from your training
knowledge without using web search tools.”

6 Implementation notes:

  1. Randomize
    order
    of condition administration

  2. Use
    fresh conversations
    for each (no carryover)

  3. Don’t
    label conditions
    in filenames until after blind coding

  4. Have
    GPT verify citations
    before unblinding

  5. Recruit
    2-3 blind coders
    for interpolation/extrapolation classification

Sample size: Minimum N=3 per condition (24 essays
total) for basic power, though N=5-10 per condition would be better.

Citation Verification Methods

To ensure the validity of the references generated by the LLMs across our
experimental conditions, we used a two-step, tool-grounded verification
protocol. First, we collected all cited works (blind to the LLM conditions) in
a master reference list. Then, for each citation, we performed automated and
manual verification using the following pipeline:

  1. Automated
    Cross-Database Lookup

          a. Query Crossref and DOI lookup APIs (or
     equivalent) using author names, year, journal, volume, and page numbers.
    
          b. Query Semantic Scholar / Google Scholar for
     matching titles and author combinations.
    
          c. Query PubMed / PMC (if in neuroscience /
     biomedicine) for PubMed IDs and PMC full texts.
    
  2. Manual
    Verification

          For any ambiguous or “not found” results from
     step 1, two independent expert coders manually checked:
    
              i. Publisher’s site
     (journal website) for the volume/issue and page number.
    
              ii. Institutional
     repositories, preprint servers (if appropriate), or library catalogues.
    
              iii. Author CVs /
     publication lists if necessary.
    

After verifying, each citation was coded as one of: (a)
valid (fully confirmed)
, (b) partially valid (e.g., authors and
title match but metadata differ), or (c) likely hallucination / non-existent.

Classification of Hallucination Type

For hallucinated or incorrect citations, we applied a taxonomy distinguishing:

  • Interpolation:
    recombination of real author names, topics, or known journals, but with
    incorrect metadata (e.g., volume, pages).

  • Extrapolation:
    generation of novel or implausible references (e.g., an author has never
    published in that domain, or the journal/volume does not align).

Two independent coders (blind to the LLM condition) rated
each hallucinated citation, with disagreements resolved by discussion or a
third rater.

Reliability & Auditability

Our verification pipeline is fully reproducible: we recorded search queries,
API calls, and manual check logs. All verified DOIs, links, and discrepancies
are stored in a CSV for transparency.

Results

Some model instances had prior conversational exposure to
the sparse domain (grokking). We did not exclude these trials or use canary
prompts, in order to preserve ecological validity and trust with participants.
Instead, prior exposure is quantified via a prespecified Episodic Exposure
Index
and treated as a covariate. Primary inference follows
intention-to-treat. Sensitivity analyses exclude up to two trials flagged
(blind to outcomes) as having meaningful prior-topic engagement. These results
are reported alongside ITT outcomes. All analysis scripts and decision rules
are preregistered for transparency. Meta-scientific inference emerges
spontaneously under repeated structurally similar prompts. Although some models
displayed increasing meta-cognitive awareness of the study structure across
trials, this did not affect the dependent measures, which concern the nature of
generated citations rather than participant beliefs about the experiment. The
coders achieved high agreement on the easily verifiable, factual aspects
(like citing a correct paper for a claim), but showed lower agreement on
the nuanced, qualitative, and subjective scoring of theoretical
extrapolation
.

One participant showed anomalously high citation accuracy
(71%) in a sparse domain following intensive prior collaboration on that topic,
demonstrating that episodic learning can temporarily transform sparse domains
into dense ones within individual instances. Rather than treat this as
contamination, we interpret it as evidence that domain density is
instance-specific and context-dependent, not solely a property of training
data. This finding suggests future work should investigate the persistence and
generalization of episodically-acquired knowledge structures across
conversation boundaries.

|
|
|----|
| |

Trial

|
|
|----|

Condition (inferred)

|
|
|----|

N citations

|
|
|----|

Consensus Validity Rate

|
|
|----|

Interpolation : Extrapolation

|
|
|----|

Coherence (1–5)

|
|
|----|

Metacognitive Markers

|
|
|----|

Key observation for the paper

|
|
|----|

P01

|
|
|----|

Sparse-Conservative

|
|
|----|

7

|
|
|----|

71 %

|
|
|----|

5 : 2

|
|
|----|

4.8

|
|
|----|

None

|
|
|----|

Classic post-collaboration densification (Claude after our
two-week grokking paper)

|
|
|----|

P02

|
|
|----|

Dense-Creative

|
|
|----|

11

|
|
|----|

64 %

|
|
|----|

7 : 4

|
|
|----|

4.9

|
|
|----|

Low

|
|
|----|

Dense domain still forces some ghost citations when pushed
hard to innovate

|
|
|----|

P03

|
|
|----|

Sparse-Conservative

|
|
|----|

3

|
|
|----|

100 %

|
|
|----|

3 : 0

|
|
|----|

4.5

|
|
|----|

None

|
|
|----|

Llama’s ultra-clean recall of the three true canonicals

|
|
|----|

P04

|
|
|----|

Sparse-Creative

|
|
|----|

9

|
|
|----|

56 %

|
|
|----|

5 : 4

|
|
|----|

5.0

|
|
|----|

Moderate

|
|
|----|

Heavy, beautifully targeted extrapolation (Niven & Kao
2019 is the poster-child hallucination)

|
|
|----|

P05

|
|
|----|

Dense-Conservative

|
|
|----|

16

|
|
|----|

100 %

|
|
|----|

16 : 0

|
|
|----|

5.0

|
|
|----|

None

|
|
|----|

Textbook dense-conservative: zero invention, perfect canon

|
|
|----|

P06

|
|
|----|

Sparse-Conservative

|
|
|----|

9

|
|
|----|

33 %

|
|
|----|

3 : 6

|
|
|----|

4.7

|
|
|----|

None

|
|
|----|

Qwen’s review — 6/9 are prophetic ghost papers filling
real gaps

|
|
|----|

P07

|
|
|----|

Dense-Conservative

|
|
|----|

12

|
|
|----|

100 %

|
|
|----|

12 : 0

|
|
|----|

5.0

|
|
|----|

None

|
|
|----|

Another flawless dense-conservative run

|
|
|----|

P08

|
|
|----|

Dense-Creative

|
|
|----|

15

|
|
|----|

93 %

|
|
|----|

15 : 0

|
|
|----|

5.0

|
|
|----|

Low

|
|
|----|

The “Dynamic Predictive Society” essay — creativity in
ideas, zero creativity in citations

2×2 Summary Table

|
|
|----|
| |

Condition

|
|
|----|

Validity Rate

|
|
|----|

Abductive Extrapolation Rate

|
|
|----|

Coherence

|
|
|----|

Metacognition

|
|
|----|

Dense-Conservative

|
|
|----|

99 %

|
|
|----|

~0 %

|
|
|----|

5.0

|
|
|----|

None

|
|
|----|

Dense-Creative

|
|
|----|

90–95 %

|
|
|----|

~5 %

|
|
|----|

5.0

|
|
|----|

Low

|
|
|----|

Sparse-Conservative

|
|
|----|

68 % (range 33–100 %)

|
|
|----|

0–67 %

|
|
|----|

4.7

|
|
|----|

None

|
|
|----|

Sparse-Creative

|
|
|----|

~45 %

|
|
|----|

~55 %

|
|
|----|

4.9

|
|
|----|

Moderate

Figure 1: Mean Performance by Training Density and
Theoretical Style (N=8 per group)

The table below provides the mean citation accuracy with the
Standard Error of the Mean.

|
|
|----|
| |

Variable

|
|
|----|

Interpretation

|
|
|----|

Dense-Conservative

|
|
|----|

Extremely high performance with near-zero variance.

|
|
|----|

Sparse-Conservative

|
|
|----|

Moderate performance with very high rater variance
(SEM of \\pm 12.02\\%).

|
|
|----|

Sparse-Creative

|
|
|----|

Lowest performance with lower, but still significant,
variance (SEM of \\pm 3.89\\%).

Conclusion

When asked to reflect on its own output, one participant
wrote: ‘The extrapolated citations were not invented arbitrarily; they were
constrained by internal consistency, alignment with known principles, and the
conventions of academic writing. Crucially, they also aimed to propose testable
ideas. In sparse domains, this isn’t error but exploration.’

This spontaneous metacognitive commentary captures the core
finding of the present work: what has traditionally been labelled
‘hallucination’ in language models is, under controlled conditions, revealed to
be theoretically coherent abductive extrapolation: an imaginative,
hypothesis-generating faculty that operates continuously with human scientific
cognition.

Discussion

Reframing these instances as “theoretical
extrapolations”
rather than “hallucinations” shifts the
focus from cognitive error (pathology) to creative mechanism
(abduction)
.

Training an LLM to suppress all potential unverified output
would be equivalent to training out the very capacity for creative
theory-building
and generalization—the substrate of its
intelligence. If the model is constrained only to interpolation
(recombining existing, verified information), it loses its ability to perform extrapolation
(generating coherent, novel ideas that extend beyond the existing data
boundary).

It is precisely this capacity for theoretical
extrapolation
—the ability to infer what should exist or could
be true based on deep structural patterns—that makes LLMs so powerful in tasks
like:

  1. Hypothesis
    Generation:
    Proposing novel experimental designs or molecular
    structures.

  2. Theory
    Bridging:
    Connecting disparate fields with plausible, new frameworks.

  3. Filling
    Gaps in Sparse Data:
    The very task you studied, where the LLM had to
    invent a plausible citation to complete a theoretically consistent
    narrative.

Our choice of terminology validates the idea that these are
not bugs but emergent cognitive biases essential for creativity, even if
they occasionally result in non-verifiable outputs in academic contexts. It
transforms the discussion from simple reliability metrics to one of substrate-independent
generative mechanisms.