What's the correct way to do thumbs up/down style training?

This might be a stupid question, but I have a project with a model that learns while it’s running based on a positive/negative rating given to each of its responses, so far I’ve been doing that using TRL’s PPOTrainer class, with something like this:

def train(self, prompt, response, reward):
    return self.ppo_trainer.step(self.encodeTensor(prompt, bos=True), self.encodeTensor(response, eos=True), [torch.tensor(reward)])

This is part of a class I use to manage anything directly model-related, ppo_trainer is an instance of the PPOTrained class, and encodeTensor is a function that just converts text to a tensor with some formatting.
This code (as far as I can tell, anyway, I haven’t had the chance to test it very much) works fine, but I’ve started getting warnings that PPOTrainer is deprecated and being replaced with PPOv2Trainer, which as far as I can tell doesn’t have a step function or equivalent that would let me train a model directly with a prompt/response pair and a rating.
I’m also fairly certain I’m using this stuff incorrectly in the first place, so I’ve been wondering what the actually correct way to do what I’m trying to do is?

1 Like