Wav2vec2 finetuned model's strange truncated predictions

What is your question?

I’m getting strange truncation of prediction at different steps of training. Please help to understand what is the issue?
At the first steps of training like 800-1600 (2-3 epochs) I’m getting predictions of valid length and words count but with low accuracy (which is ok at the first steps), After steps > ~8000 things begin getting strange - accuracy of word prediction getting better, WER respectfully getting lower but an overall sentences’ lengths getting truncated to the right side of an utterances. For example:

Target:
Dərbəndin caxır-konyak kombinatı ərazisində yanğın qeydə alınıb. Hadisə axşam saatlarında baş verib. İlkin məlumata görə, insidentə spirt məhlulunun yerə dağılması səbəb olub

Prediction @ 400 step (length is correct, WER 60+)
dərbəndin çaxır kona kombinantı erazisində yanğın qeydə alınıb harisi axşam satlarında baş verb ilki məlumata görə insidentəs birt məxlunun yerə dağılması səbəb olub

Prediction @ 800 step (length is correct, WER 50+)
dərbəndin çaxırkonakombinanta ərazisində yanğın qeydə alınıb hadisə axşamsaatlarında baş verib ilki məlumata görə insidentəs birt məhlullunun yerə dağılması səbəb olub

Prediction @ 1600 step (length getting truncated, words joining each other, WER 40+)
dərbədinçıki əazisdə ynğqdını hadişıa veiklumagörə insidentspirt məlun yerə dağılması səbəb olub

Prediction @ > 20000 step (around 30 to 100 epochs, almost no changes in WER, sentence completely truncated to the right part, WER keep around 16-27 depending on audio quality)

  1. ndəyaninsidentəspirtməluunun yerə dağılması səbəb olub
  2. insidntə spirt məhlulunun yerə dağılması səbəb olub
  3. insidentə spürt məhlulunun yerə dağılması səbəb olub
  4. nsientə spirt məhlulunun yerə dağılması səbəb olub

Code


Exactly the same code but with different epoch param (num_train_epochs 30 to 100)…

What have you tried?

Training data: 30 hours of labeled data, single spoken person per clip, around 15-30 sec each

I’ve used Fine-Tune XLSR-Wav2Vec2 for low-resource ASR with 🤗 Transformers to train very similar to Turkish language. It differs only for a few characters in alphabet so I used exactly the same params for the first training. Then removed return_attention_mask but nothing changed at all. Then I tried to fine-tune Turkish finetuned model from tutorial itself from Patrick’s hub repo - got the same results.

What’s your environment?

  • fairseq Version (e.g., 1.0 or main): current master branch
  • PyTorch Version (e.g., 1.0): the one which comes with Python 3.8
  • OS (e.g., Linux): Linux
  • How you installed fairseq ( pip , source): clone and installed
  • Python version: 3.8
  • CUDA/cuDNN version: 10.2
  • GPU models and configuration: 1 x V100S (32 GB)

@patrickvonplaten kindly asking you to shed some light on this issue. what could be the possible reasons?

Hey @BakuDev,

I don’t think it’ll be very easy to understand what’s going on there. My main question is however - if the sentence length is becoming too short, then why does the WER improve? In general the WER is quite a good metric for speech recognition, so if the WER is getting better this is usually a good sign.