When training RM we use -log(sigm(R_chosen - R_rejected)). There is nothing ensuring the range of reward ranges. How do we deal with that? Basically I have a case where my chosen reward is 0.22 and the bad one is -1.24. Are there any ways to deal with that in training stage? Or what are post-processing options. I have seen with R_min and R_max, where we rescale rewards with respect to these values, which we empirically sample, but that does not seem to be a promising solution
hmmâŚ
Treat the BT rewardâs absolute scale as non-identified. Only differences matter. Fix offset by centering. Fix scale by calibration or by pairing it with the PPO/GRPO KL weight β. Prefer z-score or âwhitenedâ rewards over minâmax. (Hugging Face)
Background, quickly
- BT loss: (-\log \sigma(r^Âąr^-)). Shifting all rewards by a constant changes nothing. Scaling rewards changes choice probabilities, so scale must be fixed by you. (Hugging Face)
- In your example, (r^+=0.22), (r^-=-1.24). Margin (=1.46). Preference prob (=\sigma(1.46)\approx0.812). The 0.22 and â1.24 means are arbitrary without a chosen offset and scale. Only the 1.46 matters for the loss. (Hugging Face)
Training-stage controls
-
Mean-centering the RM.
Use the auxiliary centering loss so rewards sit around 0 on a reference distribution. TRL exposescenter_rewards_coefficientand recommends â1e-2. This removes the offset ambiguity and stabilizes training. (Hugging Face) -
Temperature calibration on a held-out preference set.
Fit a single scalar (T>0) that best predicts preferences via (\sigma((r^Âąr^-)/T)). Then use (r_{\text{cal}}=r/T). This preserves ordering and puts margins on a calibrated logit scale. Temperature scaling is the standard one-parameter post-hoc calibration. (Proceedings of Machine Learning Research) -
Margin-aware losses if you have graded preferences.
When raters give strengths or ratings, add a margin in the BT loss. Llama-2 shows accuracy gains from a rating-based margin in the helpfulness RM. This indirectly anchors scale. (arXiv) -
Regularize or ensemble the reward head.
Ensembles reduce underspecification and over-optimization sensitivity. DeepMind shows reward models are underspecified and ensembles mitigate but do not eliminate reward hacking. (arXiv)
Post-processing options that actually help
-
Z-score / whitening.
Compute (râ=(r-\mu)/\sigma) using stable running stats or a fixed reference set. TRL offerswhiten_rewards,use_score_scaling,use_score_norm, andscore_clipso PPO computes advantages on normalized rewards. Prefer this to minâmax which is prompt-difficulty dependent. (Hugging Face) -
Temperature scaling (one parameter).
Learn (T) once, divide all rewards by (T), and keep it fixed during RL or reranking. Simple, monotone, and robust. (Proceedings of Machine Learning Research) -
Avoid per-batch minâmax.
It injects variance and makes scores incomparable across prompts or time. Use running mean and std or TRLâs built-ins instead. TRL exposes the proper knobs; see docs. (Hugging Face)
RL stage: match reward scale to KL strength
In KL-regularized RLHF the policy optimizes (r_\theta(x,y)-\beta,\mathrm{KL}(\pi|\pi_{\text{ref}})). If you scale rewards by (a), adjust (\beta) by (1/a) or enable adaptive KL so the empirical KL hits a target. TRL has adap_kl_ctrl, init_kl_coef, and target. Only the ratio âreward scale to KLâ matters. (Hugging Face)
What to do with your numbers (0.22 vs â1.24)
- Center the RM during training (
center_rewards_coefficientâ1e-2). - Calibrate a temperature (T) on held-out pairs. Replace (r\leftarrow r/T).
- Normalize in RL using
whiten_rewardsoruse_score_scalingand set adaptive KL.
This keeps your ranking intact and makes the magnitude usable by PPO/GRPO/RLOO. (Hugging Face)
Drop-in configs and code
TRL: center the reward and log margins
# docs: https://huggingface.co/docs/trl/reward_trainer
from trl import RewardTrainer, RewardConfig # TRL main docs
cfg = RewardConfig(
learning_rate=5e-7,
center_rewards_coefficient=1e-2, # see TRL docs
log_on_each_node=False,
per_device_train_batch_size=4,
)
# trainer = RewardTrainer(model=..., train_dataset=..., args=cfg)
# trainer.train()
# source: TRL RewardTrainer docs
One-parameter temperature calibration for BT margins
# reference: Guo et al. 2017 (Temperature scaling)
# https://proceedings.mlr.press/v70/guo17a/guo17a.pdf
import torch, torch.nn.functional as F
margins = torch.tensor(margins_float_list) # r_plus - r_minus on held-out
labels = torch.tensor(labels_int_list).float() # 1 if plus preferred
T = torch.nn.Parameter(torch.tensor(1.0))
opt = torch.optim.LBFGS([T], lr=0.1, max_iter=100)
def closure():
opt.zero_grad()
p = torch.sigmoid(margins / torch.clamp(T, min=1e-3))
loss = F.binary_cross_entropy(p, labels)
loss.backward()
return loss
opt.step(closure)
T_star = float(T.detach()) # use r_cal = r / T_star downstream
(Proceedings of Machine Learning Research)
TRL PPO: normalize rewards and control KL
# docs: https://huggingface.co/docs/trl/main/trainer
from trl import PPOConfig
ppo = PPOConfig(
adap_kl_ctrl=True, # adaptive KL controller
init_kl_coef=0.2, # starting beta
target=6.0, # target KL
whiten_rewards=True, # z-score rewards before advantages
use_score_scaling=True, # optional scaling
use_score_norm=True, # normalize scaled scores
)
# source: TRL Trainer docs (fields: adap_kl_ctrl, target, whiten_rewards, use_score_scaling, use_score_norm)
Alternatives that avoid explicit RM scale
- DPO and family. Optimize preferences directly. You tune β in the loss rather than shipping a separate RM with an arbitrary scale. Good when PPO instability or reward scale headaches dominate. (arXiv)
Common failure patterns tied to scale and fixes
- RM outputs collapse or look constant. Add centering loss. Verify data balance. This is a known TRL discussion. (GitHub)
- KL explodes or collapses. Mismatch between reward scale and KL. Use adaptive KL or retune β. TRL issue tracks this explicitly. (GitHub)
- Over-optimization to a single RM. Use ensembles or worst-case aggregation. Still not a silver bullet. (arXiv)
Minimal, durable recipe
- Train the RM with centering. Log margins. (Hugging Face)
- Fit a temperature on held-out comparisons. Freeze it. (Proceedings of Machine Learning Research)
- During PPO/GRPO/RLOO, whiten rewards and use adaptive KL. Target a fixed KL band. (Hugging Face)
- If you have strength labels, add a margin term. (arXiv)
- For high-stakes runs, ensemble reward heads. (arXiv)
Short, curated references
GitHub issues/discussions
- KLâreward scaling mismatch and β control. (GitHub)
- RewardTrainer producing constant scores and remedies. (GitHub)
Hugging Face TRL docs
- RewardTrainer and centering, with rationale. (Hugging Face)
- PPO/Trainer options:
whiten_rewards,use_score_scaling,use_score_norm, adaptive KL. (Hugging Face)
Papers and surveys
- Learning to Summarize with KL-regularized RLHF. Basis for rewardâβ coupling. (NeurIPS Proceedings)
- Llama-2: margin in the helpfulness RM ablation. (arXiv)
- RewardBench: evaluating reward models. Useful for sanity checks after calibration. (arXiv)
- Reward ensembles mitigate over-optimization, not eliminate. (arXiv)
- Temperature scaling for calibration. (Proceedings of Machine Learning Research)
Bottom line: stop chasing absolute reward values. Center the RM. Calibrate one temperature. Normalize at RL time and tune β. Use margins if you have strength labels. Prefer z-score over minâmax. If scale handling is painful, use DPO. (Hugging Face)