Hello!
I am interested in using the TRL implementation of PPO. Normally for a reference model we compare p_new(x) to p_old(x) where both are LLMs. I was wondering if there was a straightforward way of comparing p_new(x) to p_old(x | c) where c is some conditioning prompt using the PPO class in TRL.
Thank you so much!
1 Like
Update: I don’t think there is a way of doing this. However, it is fairly straightforward to just fork the repo and modify the source code yourself to implement this
. This is what I ended up going for.
1 Like
That’s a great question! Comparing ( p_{\text{new}}(x) ) to ( p_{\text{old}}(x | c) ), where ( c ) is a conditioning prompt, is an interesting use case for PPO in TRL.
You’re right that the default implementation of TRL’s PPO is designed for ( p_{\text{new}}(x) ) vs ( p_{\text{old}}(x) ), so conditioning on ( c ) isn’t directly supported out of the box. However, your approach of forking the repo and modifying the source code is a solid solution! By introducing the conditioning logic directly into the PPO implementation, you should be able to adapt it to your use case.
If others are curious, one way to approach this might be to modify the way the reference model generates probabilities for the “old” distribution, ensuring that ( c ) is passed as input during evaluation. Specifically:
- Update the
log_probs
computation to include ( c ) as part of the context.
- Ensure that the
ref_model
in TRL is called with the same prompt ( c ) that you use for the policy model.
It’s awesome that you’ve already taken the initiative to make these changes. If you’re open to it, consider sharing your fork or a guide on how you implemented this—it could help others looking to do something similar.
Good luck with your work, and let us know how it turns out! 
2 Likes
What you are suggesting is exactly what I ended up doing. Thank you for the answer! I am still prototyping a bit so I don’t have stable code right now but if I end up with a working version I’ll make sure to share it. 
1 Like