Abnormal learn rate curve

I am working on a school project which is to classify news headlines. It’s a binary classification. I scraped the news headlines, used sklearn train_test_split to split them. Then used ktrain - distilBert to classify them. There is a learn rate finder function, I run that and get an abnormal learn rate curve as shown in below image:


while the normal learn rate should be somehow in a U-shape, falls gradually from a higher loss then up again.

What does that abnormal learn rate curve imply? Is it to do with overfitting or anything? I am really new to the transformer thing and there are not many resources on the internet so I try to ask here. Thanks.

Hi, not sure if I understand correctly

  • Which “learn rate finder function” you use ? ( just curious as I am familiar of the idea coming from fast.ai team )
  • spike in 10^-1 looks plausible to me since it’s too much big learning rate
  • Did you initialize your model from some checkpoints ? If yes, maybe the model is already good and have small loss in the beginning, so small loss with LR=10^-7 is also plausible

I just follow this step by step tutorial of ktrain , it is a light weight wrapper of huggingface transformer. It has the learn rate finder function. I am not sure about the initializing from checkpoint part, does that mean a pretrained model? As far as I know, the tutorial is using the pretrained distilBert from Huggingface.

Also, the learn rate curve demonstrated in the tutorial is like this:


That’s why I think mine is not a normal one.

Yes, Pretrained model (= checkpoint i mentioned above) can be one reason. Maybe you can try remove the pretrained (init from scratch) and plot the graph again

Note that the learning rate curve that you post seems to be not the actual lr that is used during training, but a utility function that tests out different lr’s to see how that influences the loss function. Its goal is to help you find a good starting lr. In other words, the graph is not the learning rate changing over time, but the loss over different a different lr parameter. In that sense, it is very normal that that curve is different between checkpoints, architectures, and even datasets.

Cool to know these, thanks both!