Hi everyone,
Something I’ve been wondering recently and I’d value some input.
I’ve been working with the trl
library to fine-tune various decoder-only LLMs via RLHF. During the PPO loop, I’ll collect the rewards using something like:
raw_rewards = ppo_trainer.model.compute_reward_score(input_ids, attention_masks)
Which will returns the class logits from the previously trained AutoModelForSequenceClassification
model. We then need to turn these into a reward. I’ve seen different approaches to this, for example taking the first element of the logit (see here) or taking the last element of the logit (see here).
Examining the code for the RewardTrainer we can see how the loss function is constructed (line 238) :
loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected).mean()
If we denote the chosen and rejected rewards by the tuples (c1, c2) and (r1, r2) then we can write the above as:
And see that the loss function is going to force weight updates that should cause both elements of the reward tuple to be driven towards a high score (for a chosen input) or a low score (for a rejected input). (Hence, presumably, why I have seen code which does both.)
So, here’s the first question: should we be indifferent as to which element of the reward logit we hand to the PPO algorithm, or would it be even better to combine them - eg to sum them?
And here’s the second question: the PPOTrainer
allows us to apply batch-scaling and reward clipping to help stabilise gradients updates for the policy model, but why not just pass the reward logits through a logsigmoid()
function first (mirroring the loss function)? That, after all, would mirror what the loss function in the RewardTrainer
is doing.
Are there any theoretical reasons or implementation details that bear upon the above?