TRL + PPO + Using Conditioned Reference Model

velezbeltran · January 18, 2025, 7:48pm

Hello!

I am interested in using the TRL implementation of PPO. Normally for a reference model we compare p_new(x) to p_old(x) where both are LLMs. I was wondering if there was a straightforward way of comparing p_new(x) to p_old(x | c) where c is some conditioning prompt using the PPO class in TRL.

Thank you so much!

velezbeltran · January 26, 2025, 5:59pm

Update: I don’t think there is a way of doing this. However, it is fairly straightforward to just fork the repo and modify the source code yourself to implement this . This is what I ended up going for.

Alanturner2 · January 26, 2025, 11:48pm

That’s a great question! Comparing ( p_{\text{new}}(x) ) to ( p_{\text{old}}(x | c) ), where ( c ) is a conditioning prompt, is an interesting use case for PPO in TRL.

You’re right that the default implementation of TRL’s PPO is designed for ( p_{\text{new}}(x) ) vs ( p_{\text{old}}(x) ), so conditioning on ( c ) isn’t directly supported out of the box. However, your approach of forking the repo and modifying the source code is a solid solution! By introducing the conditioning logic directly into the PPO implementation, you should be able to adapt it to your use case.

If others are curious, one way to approach this might be to modify the way the reference model generates probabilities for the “old” distribution, ensuring that ( c ) is passed as input during evaluation. Specifically:

Update the log_probs computation to include ( c ) as part of the context.
Ensure that the ref_model in TRL is called with the same prompt ( c ) that you use for the policy model.

It’s awesome that you’ve already taken the initiative to make these changes. If you’re open to it, consider sharing your fork or a guide on how you implemented this—it could help others looking to do something similar.

Good luck with your work, and let us know how it turns out!

velezbeltran · January 27, 2025, 5:20pm

What you are suggesting is exactly what I ended up doing. Thank you for the answer! I am still prototyping a bit so I don’t have stable code right now but if I end up with a working version I’ll make sure to share it.

Topic		Replies	Views
PPO using TRL: optimal strategy for reward calculation? Research	1	925	December 20, 2023
New Version of PPOTrainer 🤗Transformers	6	430	November 24, 2024
PPO Training does not improve SFT model outputs (Metrics identical before and after PPO) 🤗Transformers	1	44	May 19, 2025
Offering a Technical Deep Dive on GRPO/DAPO/Dr. GRPO Algorithms Show and Tell	2	264	May 11, 2025
GRPO or PPO or some RL Research	1	57	May 19, 2025

TRL + PPO + Using Conditioned Reference Model

Related topics