`nan` training loss but eval loss does improve over time

I’ve been playing around with the XLSR-53 fine-tuning functionality but I keep getting nan training loss.

Audio files I’m using are:

  • Down-sampled to 16kHz
  • Set to one channel only
  • Vary in length between 4 to 10s

I’ve set the following hyper-params:

  • attention_dropout=0.1
  • hidden_dropout=0.1
  • feat_proj_dropout=0.0
  • mask_time_prob=0.05
  • layerdrop=0.1
  • learning rate:
    • on a warmup schedule to 3e-4 for 3 epochs
    • at 5e-4 for 3 epochs
    • back to 3e-4

Sadly, I’m fine-tuning the model on an unpublished corpus, so I am probably not at liberty to upload it here which might hinder reproducibility efforts greatly.

Here’s what the loss and WER progression looks like:

Anyone know what could be happening here? The model seems to be training just fine and some testing proves that the model performs well on the language I’m training it on. So what’s up with the training loss?

Pinging @patrickvonplaten and @valhalla as this might be relevant to them.

Hey @jjdv,

I’m sorry without a google colab it will be difficult to debug this for us. Given that your WER seems to decrease nicely - there might just be a problem at displaying the values…let’s see whether other people encounter the same problem

hey @patrickvonplaten!

I forgot to attach the notebook to my post. (I’m not fine-tuning on colab so feel free to just import the notebook there).

Again, not sure how useful it would be since the data isn’t available publicly (yet!)

Here’s the notebook!

1 Like

I looked a bit into it and the problem is the following:

If one loss becomes nan or inf all the following displayed losses also become nan or inf since the shown loss is the average of all losses seen so far, see: transformers/trainer.py at 82b8d8c7b02562695f88be81cf0993972e324874 · huggingface/transformers · GitHub

However this doesn’t mean that the losses after nan is displayed are actually useless → the model can very well train. So it’s more of a display error than an actual error often times. All in all my best suggestion here is to just take a look at the validation loss and if it goes down smoothly continue training

2 Likes

Someone suggested adding this parameter in hopes of getting rid of this problem:

ctc_zero_infinity=True

Loss is gonna be gigantic and it does hold that every time I faced this issue, the first training loss was Inf so this is probably a good fix for the issue!

i have same problem but also i have eval_wer is 1.0, at the beginning of training eval_wer is 0.6 and 0.5 and after 19 ephocs the eval_wer is 1.0 and still 1.0 in ephoc 33