Wav2vec2 results vary depending on far away prefix len

koder-ua · September 30, 2023, 10:45am

Here is the results of decoding part of audio, containing “Waterloo station” text with wav → IPA model (facebook/wav2vec2-lv-60-espeak-cv-ft) depending on the prefix and suffix len. The text itself contains in [6.3s → 7.8s] chunk of audio. The other wav2vec2 models show similar behaviour - the decoding results are vary depending on far away prefixes. The results are the same for pure model code without CTC and for full hg pipeline with CTC

START_TIME → END_TIME DECODED_TEXT
6.0 → 8.0 wɔːtɚluːsteɪʃən <<< the correct output
5.5 → 8.5 bɔtɚluːsteɪʃən
5.0 → 9.0 bɔːtɚluːsteɪʃən
4.5 → 9.5 wɔːtɚuːsteɪʃən
4.0 → 10.0 vɔtɚvuːsteɪʃən
3.5 → 10.5 wɔːtɾɚwuːsteɪʃən
3.0 → 11.0 wɔːtɾɚwuːsteɪʃən
2.5 → 11.5 wɔːtɚwuːsteɪʃən
2.0 → 12.0 wɔːtɾɚvuːsteɪʃən
1.5 → 12.5 wɔtɚvuːsteɪʃən
1.0 → 13.0 voːtɚvuːsteɪʃən
0.5 → 13.5 voːtɚvuːsteɪʃən

What can be the reason? How to fix this?

Topic		Replies	Views
Wav2vec - <s></s> tokens Models	0	306	January 18, 2022
Pretrained wav2vec2 speech to text - decoded text is gibberish Models	0	402	June 12, 2023
Wav2Vec2ForCTC abandons one logit sometimes Models	1	429	October 26, 2022
Decoding the logits provided by a tiny Wav2vec2 model gives sequences that do not make sense Beginners	0	245	October 25, 2022
Finetunig of wav2vec2-xls-r-300m outputs invalid words for Bengali data Models	6	684	February 1, 2023

Wav2vec2 results vary depending on far away prefix len

Related topics