What's the correct way to do thumbs up/down style training?

Void-05 · October 13, 2024, 2:24pm

This might be a stupid question, but I have a project with a model that learns while it’s running based on a positive/negative rating given to each of its responses, so far I’ve been doing that using TRL’s PPOTrainer class, with something like this:

def train(self, prompt, response, reward):
    return self.ppo_trainer.step(self.encodeTensor(prompt, bos=True), self.encodeTensor(response, eos=True), [torch.tensor(reward)])

This is part of a class I use to manage anything directly model-related, ppo_trainer is an instance of the PPOTrained class, and encodeTensor is a function that just converts text to a tensor with some formatting.
This code (as far as I can tell, anyway, I haven’t had the chance to test it very much) works fine, but I’ve started getting warnings that PPOTrainer is deprecated and being replaced with PPOv2Trainer, which as far as I can tell doesn’t have a step function or equivalent that would let me train a model directly with a prompt/response pair and a rating.
I’m also fairly certain I’m using this stuff incorrectly in the first place, so I’ve been wondering what the actually correct way to do what I’m trying to do is?

Topic		Replies	Views
Generating text while model is still training Beginners	2	1001	October 5, 2023
PPO using TRL: optimal strategy for reward calculation? Research	1	921	December 20, 2023
Trainer code for token-wise prediction model Intermediate	0	436	June 6, 2022
TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead 🤗Transformers	2	8681	July 6, 2023
Negative Kl values during PPO training (TRL library) 🤗Transformers	0	330	April 28, 2024

What's the correct way to do thumbs up/down style training?

Related topics