Wav2vec2 xlsr nan train loss

I’m running into nan training_loss when training wav2vec2 xlsr with my custom dataset.
Weird thing is that even though training_loss goes to nan, eval_loss still goes down, and error_rate (cer and wer) also goes down.
I’ve experimented with lower learning_rate, but still getting similar behavior. I’m logging with wandb.

My graphs look like the following:

There’s no value for train/loss after ~60 steps since it is nan, but eval/loss is still decreasing.

Has anyone experienced similar behavior?

I’ve let it train over the weekend, still NAN train loss, but eval loss and both WER and CER continue to decrease