Hello,
i need help for my rl project. After one iteration it keeps outputting NaN in the Tensor. I think something is wrong with the PPO updater. I dont know how to debug it probably. I used this as reference lamorel/examples/PPO_finetuning/main.py at main 路 flowersteam/lamorel 路 GitHub.
I can provide code snippets as well if needed.
Thanks in advance
Loss after episode 7: 1199.0150146484375
Encoded inputs: {'input_ids': tensor([[ 1, 17158, 356, 456, 15379, 28747, 464, 2501, 15951, 352,
28742, 8270, 3078, 864, 272, 1679, 2992, 354, 15094, 374,
288, 7076, 272, 28705, 28740, 28734, 28723, 28740, 28734, 28723,
28774, 28787, 28723, 28740, 28782, 28774, 28804, 7133, 547, 865,
272, 2992, 3837, 2490, 272, 7076, 28725, 1671, 707, 4870,
6128, 28725, 13268, 28723, 11159, 3187, 282, 349, 298, 1300,
304, 1565, 272, 6673, 28723, 7088, 1729, 486, 12573, 279,
264, 12985, 2437, 356, 272, 2718, 1587, 28723]],
device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1]], device='cuda:0')}
CausalLMOutputWithPast(loss=None, logits=tensor([[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]]], device='cuda:0',
...