I’m trying to fine-tune the distilled GPT model on a new dataset and I’m having issues with the loss diverging during training. Can anyone thing of why this might be happening? I’ve lowered the learning rate to 1e-7
which feels extremely low, and the issue persists