So in RLHF we have 3 steps.
1 - SFT: Take a base model (let’s say it is gemma-2b ) and fine tune it on your dataset
2- Reward model training:
Which base model do we use here for training. Is it Gemma-2b , trained SFT or some other model?
3- PPO Training:
Again what is our base model at this stage. As far as I understand we use the trained reward model here to get a y_hat score to calculate the loss. during PPO training.