I am trying to finetune Llama with PPOTrainer class of TRL, a similar tutorial is used to finetune gpt2 on IMDB dataset.
But I keep getting this error when logging to wandb - ValueError: autodetected range of [nan, nan] is not finite

Also many ppo related values such as â€˜ppo/loss/policyâ€™,â€˜ppo/loss/valueâ€™, â€˜ppo/loss/totalâ€™, â€˜ppo/policy/entropyâ€™, etc are nan values.
Refer this notebook(a copy of the tutorial notebook but with a different model) for the error

Hi Harshvir! I am encountering the exactly same situation while I am testing with a small gpt-neo-x model, did you already solve this problem? I would appreciate it very much if you could share the solution! Thanks!