GPT model loss diverges while fine-tuning

I’m trying to fine-tune the distilled GPT model on a new dataset and I’m having issues with the loss diverging during training. Can anyone thing of why this might be happening? I’ve lowered the learning rate to 1e-7 which feels extremely low, and the issue persists