Hey everyone! 
I just published Part 1 of a three-part investigation into numerical divergence in hybrid models (attention + linear RNN architectures like OLMo Hybrid 7B).
The question: Why does the popular FP32 LM head fix only partially solve the training/inference KL divergence problem in hybrid models?
The finding: At 1,000 tokens, combining FP32 GDN + FP32 LM head recovers ~40% of the divergence vs BF16 baseline. The surprising part — LM head and GDN recurrent states contribute roughly equally and independently (~23-25% each). You need both, not just one.
Coming next:
-
Part 1B: Does precision matching between AR and TF matter? Preliminary results suggest GDN precision matching has a surprisingly large effect (26.6% KL reduction). 
-
Part 2: Kernel fusion (torch.compile) as a separate divergence source
-
Part 3: vLLM Triton kernel effects
Full writeup here: 
Compute contributions welcome
— currently fighting CoLab for a stable A100 instance. 
1 Like
UPDATE: Part B is now live! 
If Part A was about finding the “Horsemen,” Part B is about realizing our yardstick might be broken. I scaled the study to 8 prompts and shifted the focus to the production reality: BF16 Autoregressive (AR) rollouts.
The TL;DR on Part B:
-
The “Low-Res” Filter: Using BF16 for rollouts is essentially passing a low-resolution filter over your target. Precision lost in the early recurrent layers of the GDN cannot be fully recovered by just throwing FP32 at the training (TF) step.
-
Matched or Higher Precision is King: I tested the “MiniMax Ambiguity.” It turns out that upcasting the LM head only during the training step (TF) is significantly less effective than upcasting it in both the rollout and the training. You have to bake the precision into the inference engine, not just the trainer.
-
The 45% Ceiling: Even with the “best” fixes, we only recover about 45% of the total KL divergence. There is no single silver bullet here—divergence is a distributed problem.
I’ve updated the original post with the Heatmap and the Sequence Length scaling charts. Check out the “Part B” section for the full breakdown of why your RL rewards might be flatlining due to “blurry” targets. 
Still fighting for A100s, but the data doesn’t lie! 
1 Like