Reward Model reward value range

When training RM we use -log(sigm(R_chosen - R_rejected)). There is nothing ensuring the range of reward ranges. How do we deal with that? Basically I have a case where my chosen reward is 0.22 and the bad one is -1.24. Are there any ways to deal with that in training stage? Or what are post-processing options. I have seen with R_min and R_max, where we rescale rewards with respect to these values, which we empirically sample, but that does not seem to be a promising solution

1 Like

hmm…


Treat the BT reward’s absolute scale as non-identified. Only differences matter. Fix offset by centering. Fix scale by calibration or by pairing it with the PPO/GRPO KL weight β. Prefer z-score or “whitened” rewards over min–max. (Hugging Face)

Background, quickly

  • BT loss: (-\log \sigma(r^Âąr^-)). Shifting all rewards by a constant changes nothing. Scaling rewards changes choice probabilities, so scale must be fixed by you. (Hugging Face)
  • In your example, (r^+=0.22), (r^-=-1.24). Margin (=1.46). Preference prob (=\sigma(1.46)\approx0.812). The 0.22 and −1.24 means are arbitrary without a chosen offset and scale. Only the 1.46 matters for the loss. (Hugging Face)

Training-stage controls

  1. Mean-centering the RM.
    Use the auxiliary centering loss so rewards sit around 0 on a reference distribution. TRL exposes center_rewards_coefficient and recommends ≈1e-2. This removes the offset ambiguity and stabilizes training. (Hugging Face)

  2. Temperature calibration on a held-out preference set.
    Fit a single scalar (T>0) that best predicts preferences via (\sigma((r^Âąr^-)/T)). Then use (r_{\text{cal}}=r/T). This preserves ordering and puts margins on a calibrated logit scale. Temperature scaling is the standard one-parameter post-hoc calibration. (Proceedings of Machine Learning Research)

  3. Margin-aware losses if you have graded preferences.
    When raters give strengths or ratings, add a margin in the BT loss. Llama-2 shows accuracy gains from a rating-based margin in the helpfulness RM. This indirectly anchors scale. (arXiv)

  4. Regularize or ensemble the reward head.
    Ensembles reduce underspecification and over-optimization sensitivity. DeepMind shows reward models are underspecified and ensembles mitigate but do not eliminate reward hacking. (arXiv)

Post-processing options that actually help

  • Z-score / whitening.
    Compute (r’=(r-\mu)/\sigma) using stable running stats or a fixed reference set. TRL offers whiten_rewards, use_score_scaling, use_score_norm, and score_clip so PPO computes advantages on normalized rewards. Prefer this to min–max which is prompt-difficulty dependent. (Hugging Face)

  • Temperature scaling (one parameter).
    Learn (T) once, divide all rewards by (T), and keep it fixed during RL or reranking. Simple, monotone, and robust. (Proceedings of Machine Learning Research)

  • Avoid per-batch min–max.
    It injects variance and makes scores incomparable across prompts or time. Use running mean and std or TRL’s built-ins instead. TRL exposes the proper knobs; see docs. (Hugging Face)

RL stage: match reward scale to KL strength

In KL-regularized RLHF the policy optimizes (r_\theta(x,y)-\beta,\mathrm{KL}(\pi|\pi_{\text{ref}})). If you scale rewards by (a), adjust (\beta) by (1/a) or enable adaptive KL so the empirical KL hits a target. TRL has adap_kl_ctrl, init_kl_coef, and target. Only the ratio “reward scale to KL” matters. (Hugging Face)

What to do with your numbers (0.22 vs −1.24)

  • Center the RM during training (center_rewards_coefficient≈1e-2).
  • Calibrate a temperature (T) on held-out pairs. Replace (r\leftarrow r/T).
  • Normalize in RL using whiten_rewards or use_score_scaling and set adaptive KL.
    This keeps your ranking intact and makes the magnitude usable by PPO/GRPO/RLOO. (Hugging Face)

Drop-in configs and code

TRL: center the reward and log margins

# docs: https://huggingface.co/docs/trl/reward_trainer
from trl import RewardTrainer, RewardConfig  # TRL main docs

cfg = RewardConfig(
    learning_rate=5e-7,
    center_rewards_coefficient=1e-2,  # see TRL docs
    log_on_each_node=False,
    per_device_train_batch_size=4,
)
# trainer = RewardTrainer(model=..., train_dataset=..., args=cfg)
# trainer.train()
# source: TRL RewardTrainer docs

(Hugging Face)

One-parameter temperature calibration for BT margins

# reference: Guo et al. 2017 (Temperature scaling)
# https://proceedings.mlr.press/v70/guo17a/guo17a.pdf
import torch, torch.nn.functional as F

margins = torch.tensor(margins_float_list)   # r_plus - r_minus on held-out
labels  = torch.tensor(labels_int_list).float()  # 1 if plus preferred

T = torch.nn.Parameter(torch.tensor(1.0))
opt = torch.optim.LBFGS([T], lr=0.1, max_iter=100)

def closure():
    opt.zero_grad()
    p = torch.sigmoid(margins / torch.clamp(T, min=1e-3))
    loss = F.binary_cross_entropy(p, labels)
    loss.backward()
    return loss

opt.step(closure)
T_star = float(T.detach())  # use r_cal = r / T_star downstream

(Proceedings of Machine Learning Research)

TRL PPO: normalize rewards and control KL

# docs: https://huggingface.co/docs/trl/main/trainer
from trl import PPOConfig
ppo = PPOConfig(
    adap_kl_ctrl=True,      # adaptive KL controller
    init_kl_coef=0.2,       # starting beta
    target=6.0,             # target KL
    whiten_rewards=True,    # z-score rewards before advantages
    use_score_scaling=True, # optional scaling
    use_score_norm=True,    # normalize scaled scores
)
# source: TRL Trainer docs (fields: adap_kl_ctrl, target, whiten_rewards, use_score_scaling, use_score_norm)

(Hugging Face)

Alternatives that avoid explicit RM scale

  • DPO and family. Optimize preferences directly. You tune β in the loss rather than shipping a separate RM with an arbitrary scale. Good when PPO instability or reward scale headaches dominate. (arXiv)

Common failure patterns tied to scale and fixes

  • RM outputs collapse or look constant. Add centering loss. Verify data balance. This is a known TRL discussion. (GitHub)
  • KL explodes or collapses. Mismatch between reward scale and KL. Use adaptive KL or retune β. TRL issue tracks this explicitly. (GitHub)
  • Over-optimization to a single RM. Use ensembles or worst-case aggregation. Still not a silver bullet. (arXiv)

Minimal, durable recipe

  1. Train the RM with centering. Log margins. (Hugging Face)
  2. Fit a temperature on held-out comparisons. Freeze it. (Proceedings of Machine Learning Research)
  3. During PPO/GRPO/RLOO, whiten rewards and use adaptive KL. Target a fixed KL band. (Hugging Face)
  4. If you have strength labels, add a margin term. (arXiv)
  5. For high-stakes runs, ensemble reward heads. (arXiv)

Short, curated references

GitHub issues/discussions

  • KL–reward scaling mismatch and β control. (GitHub)
  • RewardTrainer producing constant scores and remedies. (GitHub)

Hugging Face TRL docs

  • RewardTrainer and centering, with rationale. (Hugging Face)
  • PPO/Trainer options: whiten_rewards, use_score_scaling, use_score_norm, adaptive KL. (Hugging Face)

Papers and surveys

  • Learning to Summarize with KL-regularized RLHF. Basis for reward−β coupling. (NeurIPS Proceedings)
  • Llama-2: margin in the helpfulness RM ablation. (arXiv)
  • RewardBench: evaluating reward models. Useful for sanity checks after calibration. (arXiv)
  • Reward ensembles mitigate over-optimization, not eliminate. (arXiv)
  • Temperature scaling for calibration. (Proceedings of Machine Learning Research)

Bottom line: stop chasing absolute reward values. Center the RM. Calibrate one temperature. Normalize at RL time and tune β. Use margins if you have strength labels. Prefer z-score over min–max. If scale handling is painful, use DPO. (Hugging Face)