I am in the process of doing RLHF on LLaMA 2 13b. One of the steps is making a Reward Model.
Using a custom dataset of texts that are better and comparatively not so better, I made a dataset. Lets say that it is very similar to the example thats there in the official TRL library - “chosen” and “rejected” - Reward Modeling
The Reward Model was successfully made (the eval accuracy as seen in the logs was about 67% but thats a story for a different day).
Now what I would like to do is to actually pass an input and see the output of the Reward model.
However I can’t seem to make any sense of what the reward model outputs.
For example: I tried to make the input as follows -
chosen = "This is the chosen text."
rejected = "This is the rejected text."
test = {"chosen": chosen, "rejected": rejected}
Then I try -
rewards_chosen = model(
**tokenizer(chosen, return_tensors='pt')
).logits
print('reward chosen is ', rewards_chosen)
rewards_rejected = model(
**tokenizer(rejected, return_tensors='pt')
).logits
print('reward rejected is ', rewards_rejected)
loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected).mean()
print(loss)
Printing loss wasn’t helpful. I mean I do not see any trend even if I switch rewards_chosen
and rewards_rejected
in the formula.
Also the outputs did not yield any big insights. I do not understand how to make sense of rewards chosen and rewards rejected. I have had examples where rewards chosen is bigger and then in other when its smaller (shouldn’t it always be higher?).
I tried rewards_chosen>rewards_rejected
but that is also not helpful since it outputs tensor([[ True, False]])
How do we figure out what is the meaning of the output of the reward model, how do we know what string it is preferring?