Finetune Llama with PPOTrainer

I am trying to finetune Llama with PPOTrainer class of TRL, a similar tutorial is used to finetune gpt2 on IMDB dataset.
But I keep getting this error when logging to wandb - ValueError: autodetected range of [nan, nan] is not finite

Also many ppo related values such as ‘ppo/loss/policy’,‘ppo/loss/value’, ‘ppo/loss/total’, ‘ppo/policy/entropy’, etc are nan values.
Refer this notebook(a copy of the tutorial notebook but with a different model) for the error