Debugging a model that isn't learning

This is a more open-ended question. Suppose you’ve trained a model and after looking at the loss curve you see it hasn’t learned (the loss curve is flat). What sort of things would you begin to investigate to understand what might be the cause? Additionally, what sort of things might you log apriori to help you debug a model that didn’t learn (in the event that it happened)?

First things first: don’t wait until the model is done learning; use a tool like tensorboard or weights and biases, or simply print your loss after every step. Just so you can jump in quickly.

The first thing that your model should be capable of is overfitting on a tiny dataset (e.g. 1—5 samples). Trying this out should be relatively simple. If it doesn’t do so, you know something is wrong.

Good things to check: learning rate too high or too low? Did you accidentally freeze the model and your model doesn’t have trainable parameters? Are your shapes in your forward pass correct? Did you clear the gradients after optimising? And so on.

3 Likes

Extending the learning rate thing, check the learning rate schedule, print out lr and calculate the value it is supposed to be by hand. Wrong training steps calculation causing wrong lr schedule takes me a whole month to debug…

1 Like