RL for LLM keeps outputting NaN

Hello,
i need help for my rl project. After one iteration it keeps outputting NaN in the Tensor. I think something is wrong with the PPO updater. I dont know how to debug it probably. I used this as reference lamorel/examples/PPO_finetuning/main.py at main 路 flowersteam/lamorel 路 GitHub.
I can provide code snippets as well if needed.
Thanks in advance

Loss after episode 7: 1199.0150146484375
Encoded inputs: {'input_ids': tensor([[    1, 17158,   356,   456, 15379, 28747,   464,  2501, 15951,   352,
         28742,  8270,  3078,   864,   272,  1679,  2992,   354, 15094,   374,
           288,  7076,   272, 28705, 28740, 28734, 28723, 28740, 28734, 28723,
         28774, 28787, 28723, 28740, 28782, 28774, 28804,  7133,   547,   865,
           272,  2992,  3837,  2490,   272,  7076, 28725,  1671,   707,  4870,
          6128, 28725, 13268, 28723, 11159,  3187,   282,   349,   298,  1300,
           304,  1565,   272,  6673, 28723,  7088,  1729,   486, 12573,   279,
           264, 12985,  2437,   356,   272,  2718,  1587, 28723]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1]], device='cuda:0')}
CausalLMOutputWithPast(loss=None, logits=tensor([[[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         ...,
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]]], device='cuda:0',
...