Debugging a model that isn't learning

aclifton314 · September 17, 2020, 9:14pm

This is a more open-ended question. Suppose you’ve trained a model and after looking at the loss curve you see it hasn’t learned (the loss curve is flat). What sort of things would you begin to investigate to understand what might be the cause? Additionally, what sort of things might you log apriori to help you debug a model that didn’t learn (in the event that it happened)?

BramVanroy · September 17, 2020, 9:42pm

First things first: don’t wait until the model is done learning; use a tool like tensorboard or weights and biases, or simply print your loss after every step. Just so you can jump in quickly.

The first thing that your model should be capable of is overfitting on a tiny dataset (e.g. 1—5 samples). Trying this out should be relatively simple. If it doesn’t do so, you know something is wrong.

Good things to check: learning rate too high or too low? Did you accidentally freeze the model and your model doesn’t have trainable parameters? Are your shapes in your forward pass correct? Did you clear the gradients after optimising? And so on.

RichardWang · September 20, 2020, 2:37am

Extending the learning rate thing, check the learning rate schedule, print out lr and calculate the value it is supposed to be by hand. Wrong training steps calculation causing wrong lr schedule takes me a whole month to debug…

Topic		Replies	Views
Trainer doesn't show the loss at each step 🤗Transformers	20	35305	May 9, 2024
How to check or manually control the learning rate used in training? 🤗Transformers	1	8023	May 6, 2022
Abnormal learn rate curve Beginners	6	1012	January 2, 2021
Seeking Clarification: Model Evaluation - Train and Val loss 🤗Transformers	3	715	April 10, 2024
Loss Issues on Finetuning Beginners	0	311	February 22, 2024

Debugging a model that isn't learning

Related topics