Positive loss value changes to negative loss while training Informer or TimeSeriesTransformer model

I’m cutting my teeth with Huggingface by trying to use the InformerForPrediction model and I’m seeing what seems to be odd behavior. Specifically after about 40 training epochs the reported loss during training starts out positive then winds up going negative. This is the first time I’ve encountered the negative log likelihood loss function so I’m not really sure what the expected behavior is. I did some searching and when other people report having negative loss values it was because they were not passing in log() values to the loss function and negative loss values are described as “non-nonsensical.” The posts I found had 2 things different than what is going on with InformerForPrediction:

  1. They were using Torch NLLoss
  2. Their loss values started out negative

I did a little digging and it seems that InformerForPrediction is not using Torch NLLoss but as far as I could tell it is using log probabilities like it should be. The posts I found were able to fix their problem by changing their code to use log probabilities but of course that is outside the scope of the code I’ve created because the loss function and calculation is handled entirely by the Transformers library. I’m brand new to this domain so my usage of the library is suspect but I can’t for the life of me figure out what I might be doing wrong that would cause the loss values to move to be non-nonsensical as training progresses.

So my question is: is it supposed to do that? Here’s some output of loss.item() during the training run:

Epoch 0 Batch 1: 8.379134178161621
Epoch 0 Batch 2: 8.230722427368164
Epoch 0 Batch 3: 7.8155083656311035
Epoch 1 Batch 1: 7.351840019226074
Epoch 1 Batch 2: 7.242279052734375
Epoch 27 Batch 1: 1.4881057739257812
Epoch 27 Batch 2: 1.4162497520446777
Epoch 27 Batch 3: 1.4570658206939697
Epoch 28 Batch 1: 1.3400050401687622
Epoch 28 Batch 2: 1.4098151922225952
Epoch 28 Batch 3: 1.3065835237503052
Epoch 36 Batch 2: 0.23025216162204742
Epoch 36 Batch 3: 0.3020254671573639
Epoch 37 Batch 1: 0.2215476632118225
Epoch 37 Batch 2: 0.3165530562400818
Epoch 37 Batch 3: 0.15729258954524994
Epoch 38 Batch 1: 0.12012799829244614
Epoch 38 Batch 2: 0.185540109872818
Epoch 38 Batch 3: 0.027893034741282463
Epoch 39 Batch 1: 0.009235823526978493
Epoch 39 Batch 2: -0.009273972362279892
Epoch 39 Batch 3: -0.12110552191734314
Epoch 40 Batch 1: 0.1839933693408966
Epoch 40 Batch 2: -0.02105805091559887
Epoch 40 Batch 3: -0.03986062854528427
Epoch 41 Batch 1: -0.23694372177124023
Epoch 41 Batch 2: -0.20389395952224731
Epoch 41 Batch 3: -0.32341134548187256
Epoch 98 Batch 1: -2.595803737640381
Epoch 98 Batch 2: -2.6158416271209717
Epoch 98 Batch 3: -1.6498761177062988
Epoch 99 Batch 1: -1.4682114124298096
Epoch 99 Batch 2: -2.317016363143921
Epoch 99 Batch 3: -2.452997922897339

It all looks good to me at the start and it appears to be learning as loss nicely reduces as epochs increase. If I leave it running the loss looks like it plateaus around -5.5 somewhere around 850 epochs.

Thanks for any help.


Negative loss is totally fine, see also our blog post. cc @kashif

1 Like

Great, thank you! My interpretation of the loss graphs from the mentioned blog post is that when the loss goes negative that lower less is desired. In other words, a more negative value is better than a less negative value. Thus if the train loss observed is -5 and the validation loss observed is -6 it is considered an underfit condition. Is that correct?

Yes correct, the loss tends towards minus infinity, so the lower the better!

yes the loss is the negative log prob of the predicted distribution with respect to the ground truth and as the distribution starts to map the ground-truth the likelihood will tend to infinity and thus the loss will tend to negative infinity.

How do the predictions look after training and metrics?

Right, if the validation loss is smaller than the training loss then it’s an underfitting scenario.

Not exactly great.

This is after ~1400 epochs and I haven’t seen the loss improve after the first few hundred. I let it train for so long with the hopes it might be a slow starter. I’m not that worried about it because I’m just playing around trying to figure things out. I didn’t expect useful results from this specific task.

Thanks for the info. I’m going to keep on learning. My next project is going to move into NLP.