How to pass input to a Reward Model and make sense of its output?

sraj · February 13, 2024, 12:52pm

I am in the process of doing RLHF on LLaMA 2 13b. One of the steps is making a Reward Model.
Using a custom dataset of texts that are better and comparatively not so better, I made a dataset. Lets say that it is very similar to the example thats there in the official TRL library - “chosen” and “rejected” - Reward Modeling

The Reward Model was successfully made (the eval accuracy as seen in the logs was about 67% but thats a story for a different day).

Now what I would like to do is to actually pass an input and see the output of the Reward model.

However I can’t seem to make any sense of what the reward model outputs.

For example: I tried to make the input as follows -

chosen = "This is the chosen text."
rejected = "This is the rejected text."
test = {"chosen": chosen, "rejected": rejected}

Then I try -

rewards_chosen = model(
            **tokenizer(chosen, return_tensors='pt')
        ).logits
print('reward chosen is ', rewards_chosen)

rewards_rejected = model(
           **tokenizer(rejected, return_tensors='pt')
        ).logits

print('reward rejected is ', rewards_rejected)
loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected).mean()
print(loss)

Printing loss wasn’t helpful. I mean I do not see any trend even if I switch rewards_chosen and rewards_rejected in the formula.

Also the outputs did not yield any big insights. I do not understand how to make sense of rewards chosen and rewards rejected. I have had examples where rewards chosen is bigger and then in other when its smaller (shouldn’t it always be higher?).

I tried rewards_chosen>rewards_rejected but that is also not helpful since it outputs tensor([[ True, False]])

How do we figure out what is the meaning of the output of the reward model, how do we know what string it is preferring?

DarshanDeshpande · March 8, 2024, 9:20pm

Did you ever find the answer? For some reason, I am stuck with the 67% accuracy problem too (after extensive lora hyperparam tuning) and maybe that is the reason for the outputs not showing patterns.

Topic		Replies	Views
PPO using TRL: optimal strategy for reward calculation? Research	1	923	December 20, 2023
TRL Library (how to load the reward model and calculate score from some prompt answer pairs) 🤗Transformers	0	281	February 29, 2024
Scalar Reward Model 🤗Transformers	2	23	April 8, 2025
Understanding Output of `PreTrainedModel.forward` Beginners	2	1933	February 12, 2024
Understanding model output arrays Beginners	0	616	August 28, 2022

How to pass input to a Reward Model and make sense of its output?

Related topics