Hi everyone,
I am sharing an independent research project focused on how the composition of training data (specifically the math-to-non-math ratio) influences LLM stability, accuracy, and hallucination patterns.
The core of this work involves a reproducible pipeline that trains identical models on synthetic corpora with controlled ratios (r ∈ {0, 0.10, 0.25, 0.50, 0.75, 1.0}). Evaluation is performed using multi-seed inference, bootstrap resampling, and a CfC (Chain-of-Thought consistency)-based detector.
What I’m Sharing:
- Repository: Full code for dataset generation, training/eval scripts, and logs.
- Artifacts: GitHub - Damione2/math-ratio-model-performance: A fully reproducible multi‑seed ablation study quantifying how the ratio of math examples in training data affects model performance. Includes code, experiments, statistical analysis (bootstrap, permutation tests, WLS, MixedLM), and the accompanying arXiv paper.
- Key Visuals: Scatter plots (best_f1 vs math_ratio with 95% CI), bootstrap histograms of slope coefficients, and mixed-effects residuals.
Key Findings:
- Higher math ratios improve math-task accuracy but can increase domain-shifted hallucinations in general tasks.
- Mixed ratios (25–50%) demonstrate the highest stability across different random seeds.
- Bootstrap CI on the slope supports a robust negative association between math_ratio and general best_f1 under the tested sampling scheme.
Technical Feedback Requested:
- Is the multi-seed + bootstrap protocol sufficient for quantifying uncertainty here, or should I implement stratified bootstrap by model family?
- Suggestions for improving the CfC detector’s labeling (automated heuristics vs. human adjudication)?
- Are there specific mixed-model robustness checks (e.g., non-linear terms or leave-one-model-out) you would recommend?
This is a technical, reproducible research post. I welcome constructive critique and pointers to comparable evaluation scripts.
Best,
Damyan Damyanov