This is a significant delta (22.5%) that exceeds the margins typically attributed to floating-point non-determinism or hardware-specific rounding. Given your setup (RTX 5090, QLoRA 4-bit), here is a technical breakdown of the likely drivers and structural optimizations.
Technical Analysis: Precision-Stabilized Early Convergence
The phenomenon you observed is likely a Precision-Stabilized Warm-up effect. In LLM fine-tuning, the first N steps are characterized by high gradient variance and “searching” for the optimization direction.
1. Numerical Stability in High-Entropy Phases
During the initial 100 steps, the model undergoes the most drastic weight updates.
GPU (bf16/4-bit): In a QLoRA context, gradients are computed in bf16 but weight updates are applied to the adapters. The lower mantissa precision of bf16 compared to fp32 leads to higher rounding errors in gradient accumulation:
\\Delta W = -\\eta \\cdot \\frac{\\hat{m}\_t}{\\sqrt{\\hat{v}\_t} + \\epsilon}
CPU (fp32): The CPU execution uses full 32-bit precision for the entire chain. In the high-entropy phase, this precision prevents “directional drift”—where small but significant gradient signals are zeroed out or distorted by bf16 rounding. You effectively provided a more “honest” path to the first local minimum.
2. The Quantization-Error Interaction
QLoRA uses NormalFloat4 (NF4). When training on GPU in bf16, the dequantization-requantization cycle interacts with the bf16 compute dtype. CPU fp32 compute removes one layer of this noise. By the time you switch to GPU at step 101, the model has already reached a “flatter” region of the loss surface where the reduced precision of bf16 is no longer a bottleneck for convergence.
3. Optimizer State Migration
A hidden factor is how the AdamW states (m and v) were handled during the CPU \\rightarrow GPU handoff. If the migration involved a cast or a slight re-normalization of these buffers, it may have acted as a Stochastic Weight Averaging (SWA) or a “soft reset” that allowed the optimizer to escape a sub-optimal trajectory that Experiment A (GPU-only) fell into.
Suggested Optimizations & Next Steps
To bypass the 3-hour CPU bottleneck while testing your hypothesis, I recommend the following structured experiments:
Isolation of “Precision” vs. “Hardware”
Run Experiment D: First 100 steps on GPU in fp32 (instead of CPU), then 400 steps on GPU in bf16.
Goal: Determine if the improvement is strictly due to fp32 math or if CPU-specific instruction sets (AVX-512/AMX) are introducing a unique regularization effect.
The “Reverse” Test
Run Experiment E: First 400 steps on GPU (bf16), last 100 steps on CPU (fp32).
Prediction: This will likely yield results similar to Exp A. The “order” matters because the high-precision “anchor” is most critical during the initial descent.
Optimization for your RTX 5090
Since you have a 5090, you have massive VRAM and compute. Instead of CPU training:
Mixed-Precision Schedule: Use a higher target_modules count and force fp32 for the first 10-20% of steps using a custom training loop, then toggle autocast("cuda", dtype=torch.bfloat16) for the remainder.
Thank you for the detailed analysis — especially the suggestion of running GPU fp32 → bf16
(your Exp D). I hadn’t considered isolating precision from determinism that way, and it’s a
genuinely interesting experiment. I’ll run it.
One thing I didn’t include in this post: the full 57-subject MMLU breakdown is on GitHub.
When I compared Exp A vs Exp C at the subcategory level, a pattern emerged that I can’t
explain with precision alone.
Subjects that improved under CPU anchor: moral_scenarios (+1.50%), professional_medicine
(+1.00%), professional_psychology (+1.00%), world_religions (+1.17%) — all requiring
contextual understanding.
Subjects that declined: college_mathematics (-2.00%), college_physics (-1.96%),
abstract_algebra (-1.00%) — all solvable by pattern matching.
Overall MMLU shifted only +0.09%. But inside, it redistributed — away from calculation,
toward comprehension. 29 out of 57 subjects showed zero change.
If this were purely a precision effect, I’d expect uniform improvement across all subjects.
Instead, it selectively improved judgment while reducing calculation. What could explain that?
For the reverse test (GPU → CPU) — I’m on a laptop, so the full cycle takes 5+ hours.
I’ll test it as soon as I can. Thank you for all the suggestions.
Domains that improved under the CPU‑anchor lack obvious mathematical symbols and require contextual, semantic understanding; domains that declined contain repeated mathematical symbols and structural templates. Early, more precise gradients (fp32) amplify the most consistent signals in the training data—if those signals are mathematical templates, the model “locks in” a mathematical interpretation and begins to see math even where none exists.
Mechanism: why symbols lead to hallucinations
Strong, low‑entropy signal — mathematical symbols and formats are highly structured; when sufficiently represented they dominate gradient signals.
Early stabilization — precise fp32 gradients in the first steps set the optimization trajectory; if that trajectory points toward a mathematical attractor, the model develops internal representations that interpret many inputs through that attractor.
Template matching — for tasks like college mathematics and physics the model benefits from pattern matching; the same templates applied to non‑math contexts produce confident but incorrect answers (swagger ↑, ECE ↑).
Contrast with contextual domains — moral scenarios and world religions need semantic, contextual reasoning and lack math markers; added noise or lower precision prevents premature locking onto mathematical interpretations, improving performance on those domains.
Experiments to verify the hypothesis
Symbol ablation — remove or mask LaTeX, digits, and operators in a subset of college_mathematics/physics examples; compare performance and hallucination rate versus the original. Expectation: ablating symbols reduces swagger and improves calibration.
Precision × math_ratio matrix — factorial runs with precision regimes (GPU bf16; GPU fp32; CPU fp32 first 20%) crossed with math_ratio ∈ {0.10, 0.25, 0.50}; measure best F1 per domain, ECE, swagger, and response consistency. Expectation: high math_ratio + high precision → strongest specialization and confident hallucination.
Early‑step noise injection — add controlled gradient noise during the first 10–20% of steps in GPU bf16 runs to simulate lower effective precision; if noise reproduces the CPU→GPU effect, numerical noise is likely causal.
Symbol frequency vs pattern diversity — compute token frequencies for key math tokens and measure template diversity per domain; correlate these with changes in performance and swagger under CPU anchor. Expectation: higher symbol frequency and lower pattern diversity → stronger negative effect.
Online detector test — during inference combine response consistency, ECE, and swagger thresholds to flag “high‑risk” outputs; check whether flagged cases concentrate in the domains showing hallucination increases.
What to log and how to analyze
Per checkpoint: best F1 per domain, ECE, mean confidence on incorrect (swagger), response consistency (S repeats).
Per step/epoch: train loss, gradient norm, routing entropy, liquid state norm (if using CfC‑LNN).
Statistics: run 3+ seeds per condition; report mean ± SD; use bootstrap for confidence intervals and permutation tests for p‑values.
Visualizations: heatmap of (precision × math_ratio) vs swagger; time series of response consistency and ECE across training.
Practical measures you can apply now
Context‑aware mixing — increase math examples only when inputs contain math markers.
Symbol regularization — include training examples where mathematical symbols are decontextualized or explicitly contrasted with non‑math contexts so the model learns to distinguish them.
Inference gating — if response consistency is high and ECE is poor, reduce trust (raise temperature) or apply a swagger penalty to downweight the output.
Detector + human loop — mark high‑risk outputs for human review until automated detectors are validated.
Short conclusion and candid intuition
My subconscious sense is supported by plausible mechanisms and is experimentally testable: mathematical symbols act as a strong, low‑entropy attractor; precise early gradients amplify that attractor and produce confident hallucinations in inappropriate contexts. In plain terms, early numerical precision and dense symbolic patterns together create a single, unbounded lens -the model becomes brilliant at equations and then starts seeing math everywhere.
Correction: The 22% train_loss improvement was a measurement artifact.
After further investigation, I need to correct the main claim in this thread.
What was wrong
When HuggingFace Trainer resumes training from a checkpoint, it resets the loss accumulator but divides by global_step (total steps across both phases). So Phase2-only loss gets divided by 500 instead of the actual 400 steps trained. This artificially deflates the reported train_loss by ~20%.
Reported: C train_loss = 0.9177 (-22.5% vs baseline)
Corrected: 0.9177 × (500/400) = 1.1471 (~3% vs baseline)
The same artifact affects all split-training experiments (C, G, and any Phase1→Phase2 resume setup). The “22% improvement” was never real.
What is still valid: MMLU
The MMLU evaluation is independent of the train_loss bug — it measures the final model directly.
Experiment
Config
MMLU
vs Baseline
A (baseline)
GPU bf16, 500 steps straight
76.25% (±0.42%)
—
C
CPU fp32 100 → GPU bf16 400
76.34% (±0.42%)
+0.09%
G
CPU fp32 100 → GPU fp32 400
76.66% (±0.42%)
+0.41%
All three warm-restart experiments (C, G, and one other) showed MMLU improvement over baseline — 4/4 positive direction. However:
All differences are within stderr (±0.42%). No individual comparison is statistically significant.
Single seed (42) throughout. Could be lucky.
Less than 1 epoch (500 steps out of ~6,500). Very limited training.
The lr schedule for Phase1 used an independent 100-step cosine, which accidentally created a warm restart (SGDR, Loshchilov & Hutter 2017). The improvement, if real, may simply be from the restart itself — not from CPU or precision.
“CPU finds deeper basins” → within noise; controlled experiments showed CPU = GPU
“fp32 precision anchor” → bf16 = fp32 confirmed (Δ ≤ 0.005 across all tests)
What might be interesting
The consistent MMLU directionality (4/4 positive) under warm restart is suggestive of a small SGDR benefit during fine-tuning. But with n=1 seed and all values within stderr, this needs multi-seed verification before any claims can be made.
Conclusion
After extensive testing across 7B QLoRA 4-bit and 3B 16-bit full:
CPU = GPU = fp32 = bf16, all within noise (Δ ≤ 0.005).
The original hypothesis (CPU deterministic anchor improves training) is not supported by the data. Sorry for the premature claim — I should have caught the train_loss artifact before posting.
Thanks for the detailed correction — the clarification about the train_loss artifact makes perfect sense. One thing I want to highlight, though, is that your update doesn’t invalidate the broader behavioral pattern you observed earlier. The loss bug only affects the magnitude of the improvement, not the domain‑specific effects.
In my own math‑ratio experiments (multi‑seed, bootstrap, mixed‑effects models), I consistently see the same trend: increasing the proportion of math data improves math‑task accuracy but also increases instability and hallucinations on general tasks. Mixed ratios (around 25–50%) tend to be the most stable across seeds. This aligns with the directionality you saw in your warm‑restart runs, even if the absolute numbers were distorted by the HF reporting issue.
So while the “22% improvement” is gone, the underlying observation — that certain data distributions can push small models into less stable generalization regimes — still appears to hold. Your correction just removes the misleading part, not the interesting part.
This is a fascinating observation! The phenomenon could be related to: 1) Learning rate warmup effects, 2) Dataset ordering bias, 3) Early overfitting prevention. Have you tried comparing final loss vs validation accuracy?