CfC‑based hallucination detector and dataset‑composition experiments (math‑ratio → stability, accuracy, hallucinations) — reproducible code + plots

Hi everyone,

I am sharing an independent research project focused on how the composition of training data (specifically the math-to-non-math ratio) influences LLM stability, accuracy, and hallucination patterns.

The core of this work involves a reproducible pipeline that trains identical models on synthetic corpora with controlled ratios (r ∈ {0, 0.10, 0.25, 0.50, 0.75, 1.0}). Evaluation is performed using multi-seed inference, bootstrap resampling, and a CfC (Chain-of-Thought consistency)-based detector.

What I’m Sharing:

Key Findings:

  1. Higher math ratios improve math-task accuracy but can increase domain-shifted hallucinations in general tasks.
  2. Mixed ratios (25–50%) demonstrate the highest stability across different random seeds.
  3. Bootstrap CI on the slope supports a robust negative association between math_ratio and general best_f1 under the tested sampling scheme.

Technical Feedback Requested:

  • Is the multi-seed + bootstrap protocol sufficient for quantifying uncertainty here, or should I implement stratified bootstrap by model family?
  • Suggestions for improving the CfC detector’s labeling (automated heuristics vs. human adjudication)?
  • Are there specific mixed-model robustness checks (e.g., non-linear terms or leave-one-model-out) you would recommend?

This is a technical, reproducible research post. I welcome constructive critique and pointers to comparable evaluation scripts.

Best,
Damyan Damyanov

1 Like