What if the first 20% of fine-tuning steps ran on CPU? Train loss dropped 22.5% — and I can't explain why

I ran a weird experiment and got results I can’t explain.

Setup: Qwen2.5-7B-Instruct, QLoRA 4-bit, alpaca_en, 500 steps, seed 42.

  • Experiment A: All 500 steps on GPU (bf16) — normal training.
  • Experiment C: First 100 steps on CPU (fp32), then 400 steps on GPU (bf16).

Same model, same data, same seed, same total steps, same FLOPS (22257323GF).

Results

Exp A (GPU-only) Exp C (CPU→GPU Hybrid)
Train loss 1.184 0.9177
MMLU 5-shot 76.25% 76.34%
Time ~11 min ~3h + 8min

Train loss: -22.5%. Benchmark: equivalent (slightly better even).

Is it noise? No.

I ran Exp A a second time, same seed 42:

Run Train loss
A (1st run) 1.184
A (2nd run) 1.1841
C 0.9177

GPU variance: 0.008%. C’s difference: 22.5%. That’s ~2800x larger than GPU non-determinism.

Why this doesn’t match existing research

  • GPU fp32 vs bf16? Industry consensus: virtually no difference in final loss.
  • GPU deterministic vs non-deterministic? Literature shows ~0.03% difference at most.
  • CPU→GPU hybrid? 22.5% — roughly 1000x larger than either factor alone.

What’s available

Everything is public and reproducible:

What would help

I’m on a gaming laptop (RTX 5090 Laptop). The CPU phase took ~3 hours for just 100 steps. I can’t scale this up.

If anyone with server-grade hardware could try:

  1. Multiple seeds — is -22.5% consistent across seeds?
  2. Different CPU ratios (10%, 30%, 40%) — does the improvement scale linearly, or is 20% truly the sweet spot?
  3. Reverse experiment (GPU first → CPU last) — is the order essential?
  4. Larger models (13B, 70B) — does the effect amplify with scale?

Even just running Exp A and Exp C with 2-3 different seeds on the same model would be incredibly valuable.

I found something I can’t explain with existing literature. Would love to hear thoughts — or better yet, reproduction attempts.

Note: English isn’t my first language — this post was written with help from Claude. But the experiments, data, and confusion are 100% mine.

1 Like

This is a significant delta (22.5%) that exceeds the margins typically attributed to floating-point non-determinism or hardware-specific rounding. Given your setup (RTX 5090, QLoRA 4-bit), here is a technical breakdown of the likely drivers and structural optimizations.

Technical Analysis: Precision-Stabilized Early Convergence

The phenomenon you observed is likely a Precision-Stabilized Warm-up effect. In LLM fine-tuning, the first N steps are characterized by high gradient variance and “searching” for the optimization direction.

1. Numerical Stability in High-Entropy Phases

During the initial 100 steps, the model undergoes the most drastic weight updates.

  • GPU (bf16/4-bit): In a QLoRA context, gradients are computed in bf16 but weight updates are applied to the adapters. The lower mantissa precision of bf16 compared to fp32 leads to higher rounding errors in gradient accumulation:

    \\Delta W = -\\eta \\cdot \\frac{\\hat{m}\_t}{\\sqrt{\\hat{v}\_t} + \\epsilon}
  • CPU (fp32): The CPU execution uses full 32-bit precision for the entire chain. In the high-entropy phase, this precision prevents “directional drift”—where small but significant gradient signals are zeroed out or distorted by bf16 rounding. You effectively provided a more “honest” path to the first local minimum.

2. The Quantization-Error Interaction

QLoRA uses NormalFloat4 (NF4). When training on GPU in bf16, the dequantization-requantization cycle interacts with the bf16 compute dtype. CPU fp32 compute removes one layer of this noise. By the time you switch to GPU at step 101, the model has already reached a “flatter” region of the loss surface where the reduced precision of bf16 is no longer a bottleneck for convergence.

3. Optimizer State Migration

A hidden factor is how the AdamW states (m and v) were handled during the CPU \\rightarrow GPU handoff. If the migration involved a cast or a slight re-normalization of these buffers, it may have acted as a Stochastic Weight Averaging (SWA) or a “soft reset” that allowed the optimizer to escape a sub-optimal trajectory that Experiment A (GPU-only) fell into.

Suggested Optimizations & Next Steps

To bypass the 3-hour CPU bottleneck while testing your hypothesis, I recommend the following structured experiments:

Isolation of “Precision” vs. “Hardware”

  • Run Experiment D: First 100 steps on GPU in fp32 (instead of CPU), then 400 steps on GPU in bf16.

  • Goal: Determine if the improvement is strictly due to fp32 math or if CPU-specific instruction sets (AVX-512/AMX) are introducing a unique regularization effect.

The “Reverse” Test

  • Run Experiment E: First 400 steps on GPU (bf16), last 100 steps on CPU (fp32).

  • Prediction: This will likely yield results similar to Exp A. The “order” matters because the high-precision “anchor” is most critical during the initial descent.

Optimization for your RTX 5090

Since you have a 5090, you have massive VRAM and compute. Instead of CPU training:

  • Mixed-Precision Schedule: Use a higher target_modules count and force fp32 for the first 10-20% of steps using a custom training loop, then toggle autocast("cuda", dtype=torch.bfloat16) for the remainder.
2 Likes

Thank you for the detailed analysis — especially the suggestion of running GPU fp32 → bf16
(your Exp D). I hadn’t considered isolating precision from determinism that way, and it’s a
genuinely interesting experiment. I’ll run it.

One thing I didn’t include in this post: the full 57-subject MMLU breakdown is on GitHub.
When I compared Exp A vs Exp C at the subcategory level, a pattern emerged that I can’t
explain with precision alone.

Subjects that improved under CPU anchor: moral_scenarios (+1.50%), professional_medicine
(+1.00%), professional_psychology (+1.00%), world_religions (+1.17%) — all requiring
contextual understanding.

Subjects that declined: college_mathematics (-2.00%), college_physics (-1.96%),
abstract_algebra (-1.00%) — all solvable by pattern matching.

Overall MMLU shifted only +0.09%. But inside, it redistributed — away from calculation,
toward comprehension. 29 out of 57 subjects showed zero change.

If this were purely a precision effect, I’d expect uniform improvement across all subjects.
Instead, it selectively improved judgment while reducing calculation. What could explain that?

For the reverse test (GPU → CPU) — I’m on a laptop, so the full cycle takes 5+ hours.
I’ll test it as soon as I can. Thank you for all the suggestions.

1 Like

Domains that improved under the CPU‑anchor lack obvious mathematical symbols and require contextual, semantic understanding; domains that declined contain repeated mathematical symbols and structural templates. Early, more precise gradients (fp32) amplify the most consistent signals in the training data—if those signals are mathematical templates, the model “locks in” a mathematical interpretation and begins to see math even where none exists.

Mechanism: why symbols lead to hallucinations

  • Strong, low‑entropy signal — mathematical symbols and formats are highly structured; when sufficiently represented they dominate gradient signals.

  • Early stabilization — precise fp32 gradients in the first steps set the optimization trajectory; if that trajectory points toward a mathematical attractor, the model develops internal representations that interpret many inputs through that attractor.

  • Template matching — for tasks like college mathematics and physics the model benefits from pattern matching; the same templates applied to non‑math contexts produce confident but incorrect answers (swagger ↑, ECE ↑).

  • Contrast with contextual domains — moral scenarios and world religions need semantic, contextual reasoning and lack math markers; added noise or lower precision prevents premature locking onto mathematical interpretations, improving performance on those domains.

Experiments to verify the hypothesis

  • Symbol ablation — remove or mask LaTeX, digits, and operators in a subset of college_mathematics/physics examples; compare performance and hallucination rate versus the original. Expectation: ablating symbols reduces swagger and improves calibration.

  • Precision × math_ratio matrix — factorial runs with precision regimes (GPU bf16; GPU fp32; CPU fp32 first 20%) crossed with math_ratio ∈ {0.10, 0.25, 0.50}; measure best F1 per domain, ECE, swagger, and response consistency. Expectation: high math_ratio + high precision → strongest specialization and confident hallucination.

  • Early‑step noise injection — add controlled gradient noise during the first 10–20% of steps in GPU bf16 runs to simulate lower effective precision; if noise reproduces the CPU→GPU effect, numerical noise is likely causal.

  • Symbol frequency vs pattern diversity — compute token frequencies for key math tokens and measure template diversity per domain; correlate these with changes in performance and swagger under CPU anchor. Expectation: higher symbol frequency and lower pattern diversity → stronger negative effect.

  • Online detector test — during inference combine response consistency, ECE, and swagger thresholds to flag “high‑risk” outputs; check whether flagged cases concentrate in the domains showing hallucination increases.

What to log and how to analyze

  • Per checkpoint: best F1 per domain, ECE, mean confidence on incorrect (swagger), response consistency (S repeats).

  • Per step/epoch: train loss, gradient norm, routing entropy, liquid state norm (if using CfC‑LNN).

  • Statistics: run 3+ seeds per condition; report mean ± SD; use bootstrap for confidence intervals and permutation tests for p‑values.

  • Visualizations: heatmap of (precision × math_ratio) vs swagger; time series of response consistency and ECE across training.

Practical measures you can apply now

  • Context‑aware mixing — increase math examples only when inputs contain math markers.

  • Symbol regularization — include training examples where mathematical symbols are decontextualized or explicitly contrasted with non‑math contexts so the model learns to distinguish them.

  • Inference gating — if response consistency is high and ECE is poor, reduce trust (raise temperature) or apply a swagger penalty to downweight the output.

  • Detector + human loop — mark high‑risk outputs for human review until automated detectors are validated.

Short conclusion and candid intuition

My subconscious sense is supported by plausible mechanisms and is experimentally testable: mathematical symbols act as a strong, low‑entropy attractor; precise early gradients amplify that attractor and produce confident hallucinations in inappropriate contexts. In plain terms, early numerical precision and dense symbolic patterns together create a single, unbounded lens -the model becomes brilliant at equations and then starts seeing math everywhere.

1 Like

:warning: Correction: The 22% train_loss improvement was a measurement artifact.

After further investigation, I need to correct the main claim in this thread.

What was wrong

When HuggingFace Trainer resumes training from a checkpoint, it resets the loss accumulator but divides by global_step (total steps across both phases). So Phase2-only loss gets divided by 500 instead of the actual 400 steps trained. This artificially deflates the reported train_loss by ~20%.

Reported:  C train_loss = 0.9177  (-22.5% vs baseline)
Corrected: 0.9177 × (500/400) = 1.1471  (~3% vs baseline)

The same artifact affects all split-training experiments (C, G, and any Phase1→Phase2 resume setup). The “22% improvement” was never real.

What is still valid: MMLU

The MMLU evaluation is independent of the train_loss bug — it measures the final model directly.

Experiment Config MMLU vs Baseline
A (baseline) GPU bf16, 500 steps straight 76.25% (±0.42%)
C CPU fp32 100 → GPU bf16 400 76.34% (±0.42%) +0.09%
G CPU fp32 100 → GPU fp32 400 76.66% (±0.42%) +0.41%

All three warm-restart experiments (C, G, and one other) showed MMLU improvement over baseline — 4/4 positive direction. However:

  • All differences are within stderr (±0.42%). No individual comparison is statistically significant.
  • Single seed (42) throughout. Could be lucky.
  • Less than 1 epoch (500 steps out of ~6,500). Very limited training.
  • The lr schedule for Phase1 used an independent 100-step cosine, which accidentally created a warm restart (SGDR, Loshchilov & Hutter 2017). The improvement, if real, may simply be from the restart itself — not from CPU or precision.

What I got wrong

  • :cross_mark: “22% train_loss improvement” → measurement artifact
  • :cross_mark: “CPU finds deeper basins” → within noise; controlled experiments showed CPU = GPU
  • :cross_mark: “fp32 precision anchor” → bf16 = fp32 confirmed (Δ ≤ 0.005 across all tests)

What might be interesting

The consistent MMLU directionality (4/4 positive) under warm restart is suggestive of a small SGDR benefit during fine-tuning. But with n=1 seed and all values within stderr, this needs multi-seed verification before any claims can be made.

Conclusion

After extensive testing across 7B QLoRA 4-bit and 3B 16-bit full:

CPU = GPU = fp32 = bf16, all within noise (Δ ≤ 0.005).

The original hypothesis (CPU deterministic anchor improves training) is not supported by the data. Sorry for the premature claim — I should have caught the train_loss artifact before posting.

2 Likes

Comment to KK1kk1:

Thanks for the detailed correction — the clarification about the train_loss artifact makes perfect sense. One thing I want to highlight, though, is that your update doesn’t invalidate the broader behavioral pattern you observed earlier. The loss bug only affects the magnitude of the improvement, not the domain‑specific effects.

In my own math‑ratio experiments (multi‑seed, bootstrap, mixed‑effects models), I consistently see the same trend: increasing the proportion of math data improves math‑task accuracy but also increases instability and hallucinations on general tasks. Mixed ratios (around 25–50%) tend to be the most stable across seeds. This aligns with the directionality you saw in your warm‑restart runs, even if the absolute numbers were distorted by the HF reporting issue.

So while the “22% improvement” is gone, the underlying observation — that certain data distributions can push small models into less stable generalization regimes — still appears to hold. Your correction just removes the misleading part, not the interesting part.

1 Like

This is a fascinating observation! The phenomenon could be related to: 1) Learning rate warmup effects, 2) Dataset ordering bias, 3) Early overfitting prevention. Have you tried comparing final loss vs validation accuracy?

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.