Reward Model reward value range

entfane · October 27, 2025, 3:18pm

When training RM we use -log(sigm(R_chosen - R_rejected)). There is nothing ensuring the range of reward ranges. How do we deal with that? Basically I have a case where my chosen reward is 0.22 and the bad one is -1.24. Are there any ways to deal with that in training stage? Or what are post-processing options. I have seen with R_min and R_max, where we rescale rewards with respect to these values, which we empirically sample, but that does not seem to be a promising solution

John6666 · October 28, 2025, 2:01am

hmm…

Treat the BT reward’s absolute scale as non-identified. Only differences matter. Fix offset by centering. Fix scale by calibration or by pairing it with the PPO/GRPO KL weight β. Prefer z-score or “whitened” rewards over min–max. (Hugging Face)

Background, quickly

BT loss: (-\log \sigma(r^±r^-)). Shifting all rewards by a constant changes nothing. Scaling rewards changes choice probabilities, so scale must be fixed by you. (Hugging Face)
In your example, (r^+=0.22), (r^-=-1.24). Margin (=1.46). Preference prob (=\sigma(1.46)\approx0.812). The 0.22 and −1.24 means are arbitrary without a chosen offset and scale. Only the 1.46 matters for the loss. (Hugging Face)

Training-stage controls

Mean-centering the RM.
Use the auxiliary centering loss so rewards sit around 0 on a reference distribution. TRL exposes center_rewards_coefficient and recommends ≈1e-2. This removes the offset ambiguity and stabilizes training. (Hugging Face)
Temperature calibration on a held-out preference set.
Fit a single scalar (T>0) that best predicts preferences via (\sigma((r^±r^-)/T)). Then use (r_{\text{cal}}=r/T). This preserves ordering and puts margins on a calibrated logit scale. Temperature scaling is the standard one-parameter post-hoc calibration. (Proceedings of Machine Learning Research)
Margin-aware losses if you have graded preferences.
When raters give strengths or ratings, add a margin in the BT loss. Llama-2 shows accuracy gains from a rating-based margin in the helpfulness RM. This indirectly anchors scale. (arXiv)
Regularize or ensemble the reward head.
Ensembles reduce underspecification and over-optimization sensitivity. DeepMind shows reward models are underspecified and ensembles mitigate but do not eliminate reward hacking. (arXiv)

Post-processing options that actually help

Z-score / whitening.
Compute (r’=(r-\mu)/\sigma) using stable running stats or a fixed reference set. TRL offers whiten_rewards, use_score_scaling, use_score_norm, and score_clip so PPO computes advantages on normalized rewards. Prefer this to min–max which is prompt-difficulty dependent. (Hugging Face)
Temperature scaling (one parameter).
Learn (T) once, divide all rewards by (T), and keep it fixed during RL or reranking. Simple, monotone, and robust. (Proceedings of Machine Learning Research)
Avoid per-batch min–max.
It injects variance and makes scores incomparable across prompts or time. Use running mean and std or TRL’s built-ins instead. TRL exposes the proper knobs; see docs. (Hugging Face)

RL stage: match reward scale to KL strength

In KL-regularized RLHF the policy optimizes (r_\theta(x,y)-\beta,\mathrm{KL}(\pi|\pi_{\text{ref}})). If you scale rewards by (a), adjust (\beta) by (1/a) or enable adaptive KL so the empirical KL hits a target. TRL has adap_kl_ctrl, init_kl_coef, and target. Only the ratio “reward scale to KL” matters. (Hugging Face)

What to do with your numbers (0.22 vs −1.24)

Center the RM during training (center_rewards_coefficient≈1e-2).
Calibrate a temperature (T) on held-out pairs. Replace (r\leftarrow r/T).
Normalize in RL using whiten_rewards or use_score_scaling and set adaptive KL.
This keeps your ranking intact and makes the magnitude usable by PPO/GRPO/RLOO. (Hugging Face)

Drop-in configs and code

TRL: center the reward and log margins

# docs: https://huggingface.co/docs/trl/reward_trainer
from trl import RewardTrainer, RewardConfig  # TRL main docs

cfg = RewardConfig(
    learning_rate=5e-7,
    center_rewards_coefficient=1e-2,  # see TRL docs
    log_on_each_node=False,
    per_device_train_batch_size=4,
)
# trainer = RewardTrainer(model=..., train_dataset=..., args=cfg)
# trainer.train()
# source: TRL RewardTrainer docs

(Hugging Face)

One-parameter temperature calibration for BT margins

# reference: Guo et al. 2017 (Temperature scaling)
# https://proceedings.mlr.press/v70/guo17a/guo17a.pdf
import torch, torch.nn.functional as F

margins = torch.tensor(margins_float_list)   # r_plus - r_minus on held-out
labels  = torch.tensor(labels_int_list).float()  # 1 if plus preferred

T = torch.nn.Parameter(torch.tensor(1.0))
opt = torch.optim.LBFGS([T], lr=0.1, max_iter=100)

def closure():
    opt.zero_grad()
    p = torch.sigmoid(margins / torch.clamp(T, min=1e-3))
    loss = F.binary_cross_entropy(p, labels)
    loss.backward()
    return loss

opt.step(closure)
T_star = float(T.detach())  # use r_cal = r / T_star downstream

(Proceedings of Machine Learning Research)

TRL PPO: normalize rewards and control KL

# docs: https://huggingface.co/docs/trl/main/trainer
from trl import PPOConfig
ppo = PPOConfig(
    adap_kl_ctrl=True,      # adaptive KL controller
    init_kl_coef=0.2,       # starting beta
    target=6.0,             # target KL
    whiten_rewards=True,    # z-score rewards before advantages
    use_score_scaling=True, # optional scaling
    use_score_norm=True,    # normalize scaled scores
)
# source: TRL Trainer docs (fields: adap_kl_ctrl, target, whiten_rewards, use_score_scaling, use_score_norm)

(Hugging Face)

Alternatives that avoid explicit RM scale

DPO and family. Optimize preferences directly. You tune β in the loss rather than shipping a separate RM with an arbitrary scale. Good when PPO instability or reward scale headaches dominate. (arXiv)

Common failure patterns tied to scale and fixes

RM outputs collapse or look constant. Add centering loss. Verify data balance. This is a known TRL discussion. (GitHub)
KL explodes or collapses. Mismatch between reward scale and KL. Use adaptive KL or retune β. TRL issue tracks this explicitly. (GitHub)
Over-optimization to a single RM. Use ensembles or worst-case aggregation. Still not a silver bullet. (arXiv)

Minimal, durable recipe

Train the RM with centering. Log margins. (Hugging Face)
Fit a temperature on held-out comparisons. Freeze it. (Proceedings of Machine Learning Research)
During PPO/GRPO/RLOO, whiten rewards and use adaptive KL. Target a fixed KL band. (Hugging Face)
If you have strength labels, add a margin term. (arXiv)
For high-stakes runs, ensemble reward heads. (arXiv)

Short, curated references

GitHub issues/discussions

KL–reward scaling mismatch and β control. (GitHub)
RewardTrainer producing constant scores and remedies. (GitHub)

Hugging Face TRL docs

RewardTrainer and centering, with rationale. (Hugging Face)
PPO/Trainer options: whiten_rewards, use_score_scaling, use_score_norm, adaptive KL. (Hugging Face)

Papers and surveys

Learning to Summarize with KL-regularized RLHF. Basis for reward−β coupling. (NeurIPS Proceedings)
Llama-2: margin in the helpfulness RM ablation. (arXiv)
RewardBench: evaluating reward models. Useful for sanity checks after calibration. (arXiv)
Reward ensembles mitigate over-optimization, not eliminate. (arXiv)
Temperature scaling for calibration. (Proceedings of Machine Learning Research)

Bottom line: stop chasing absolute reward values. Center the RM. Calibrate one temperature. Normalize at RL time and tune β. Use margins if you have strength labels. Prefer z-score over min–max. If scale handling is painful, use DPO. (Hugging Face)

Topic		Replies	Views
PPO using TRL: optimal strategy for reward calculation? Research	1	949	December 20, 2023
How to pass input to a Reward Model and make sense of its output? 🤗Transformers	1	405	March 8, 2024
In the rl course for the elf on slippery ice Beginners	0	149	June 3, 2023
Using penalized sampling from CTRL 🤗Transformers	1	346	February 4, 2021
Masked language modeling loss 🤗Transformers	1	4773	August 13, 2020