Wav2vec2 finetuned model's strange truncated predictions

BakuDev · November 26, 2021, 5:26pm

What is your question?

I’m getting strange truncation of prediction at different steps of training. Please help to understand what is the issue?
At the first steps of training like 800-1600 (2-3 epochs) I’m getting predictions of valid length and words count but with low accuracy (which is ok at the first steps), After steps > ~8000 things begin getting strange - accuracy of word prediction getting better, WER respectfully getting lower but an overall sentences’ lengths getting truncated to the right side of an utterances. For example:

Target:
Dərbəndin caxır-konyak kombinatı ərazisində yanğın qeydə alınıb. Hadisə axşam saatlarında baş verib. İlkin məlumata görə, insidentə spirt məhlulunun yerə dağılması səbəb olub

Prediction @ 400 step (length is correct, WER 60+)
dərbəndin çaxır kona kombinantı erazisində yanğın qeydə alınıb harisi axşam satlarında baş verb ilki məlumata görə insidentəs birt məxlunun yerə dağılması səbəb olub

Prediction @ 800 step (length is correct, WER 50+)
dərbəndin çaxırkonakombinanta ərazisində yanğın qeydə alınıb hadisə axşamsaatlarında baş verib ilki məlumata görə insidentəs birt məhlullunun yerə dağılması səbəb olub

Prediction @ 1600 step (length getting truncated, words joining each other, WER 40+)
dərbədinçıki əazisdə ynğqdını hadişıa veiklumagörə insidentspirt məlun yerə dağılması səbəb olub

Prediction @ > 20000 step (around 30 to 100 epochs, almost no changes in WER, sentence completely truncated to the right part, WER keep around 16-27 depending on audio quality)

ndəyaninsidentəspirtməluunun yerə dağılması səbəb olub
insidntə spirt məhlulunun yerə dağılması səbəb olub
insidentə spürt məhlulunun yerə dağılması səbəb olub
nsientə spirt məhlulunun yerə dağılması səbəb olub

Code

Exactly the same code but with different epoch param (num_train_epochs 30 to 100)…

What have you tried?

Training data: 30 hours of labeled data, single spoken person per clip, around 15-30 sec each

I’ve used Fine-Tune XLSR-Wav2Vec2 for low-resource ASR with 🤗 Transformers to train very similar to Turkish language. It differs only for a few characters in alphabet so I used exactly the same params for the first training. Then removed return_attention_mask but nothing changed at all. Then I tried to fine-tune Turkish finetuned model from tutorial itself from Patrick’s hub repo - got the same results.

What’s your environment?

fairseq Version (e.g., 1.0 or main): current master branch
PyTorch Version (e.g., 1.0): the one which comes with Python 3.8
OS (e.g., Linux): Linux
How you installed fairseq ( pip , source): clone and installed
Python version: 3.8
CUDA/cuDNN version: 10.2
GPU models and configuration: 1 x V100S (32 GB)

BakuDev · November 28, 2021, 7:27am

@patrickvonplaten kindly asking you to shed some light on this issue. what could be the possible reasons?

patrickvonplaten · December 7, 2021, 2:18pm

Hey @BakuDev,

I don’t think it’ll be very easy to understand what’s going on there. My main question is however - if the sentence length is becoming too short, then why does the WER improve? In general the WER is quite a good metric for speech recognition, so if the WER is getting better this is usually a good sign.

Topic		Replies	Views
Wav2vec2 not converging when finetuning 🤗Transformers	7	2535	June 15, 2021
Inconsistent evaluation result (WER) when finetuning wav2vev2 pretrained model Beginners	2	453	November 7, 2023
T5 for Q&A produces truncated sentence Beginners	0	290	December 6, 2022
Wav2Vec2: loss growing in training and validation after few epochs Models	6	2044	September 25, 2024
Wav2Vec2: fix growing training and validation loss after few epochs Models	5	2241	January 27, 2022

Wav2vec2 finetuned model's strange truncated predictions

What is your question?

Code

What have you tried?

What’s your environment?

Related topics