RLHF steps and logic question

huggingzob · October 8, 2024, 5:25pm

So in RLHF we have 3 steps.
1 - SFT: Take a base model (let’s say it is gemma-2b ) and fine tune it on your dataset

2- Reward model training:
Which base model do we use here for training. Is it Gemma-2b , trained SFT or some other model?

3- PPO Training:
Again what is our base model at this stage. As far as I understand we use the trained reward model here to get a y_hat score to calculate the loss. during PPO training.

Topic		Replies	Views
New Version of PPOTrainer 🤗Transformers	6	480	November 24, 2024
TRL Library (how to load the reward model and calculate score from some prompt answer pairs) 🤗Transformers	0	287	February 29, 2024
How to pass input to a Reward Model and make sense of its output? 🤗Transformers	1	398	March 8, 2024
Scalar Reward Model 🤗Transformers	2	35	April 8, 2025
Finetuning on base or instruct model? Beginners	0	1797	April 6, 2024

RLHF steps and logic question

Related topics