Hi! From what I understand the old model in GRPO gets updated per epoch. Here: trl/trl/trainer/grpo_trainer.py at main · huggingface/trl · GitHub I see that the current model probabilities are used for the old model too. This would work if we’re updating the old model per batch. But not per epoch correct?
Thank you.