Hello @lvwerra , @natolambert , I am trying to use a Pegasus model and improve it in certain aspects using the TRL library. My reward function is based on ROUGE. While training it on a subset of the CNN dataset, the model loss seems to explode and the model outputs gibberish. Since I am new to this area, I needed some help understanding the problem. You can view the Wandb logs here.