Here is the results of decoding part of audio, containing “Waterloo station” text with wav → IPA model (facebook/wav2vec2-lv-60-espeak-cv-ft) depending on the prefix and suffix len. The text itself contains in [6.3s → 7.8s] chunk of audio. The other wav2vec2 models show similar behaviour - the decoding results are vary depending on far away prefixes. The results are the same for pure model code without CTC and for full hg pipeline with CTC
START_TIME → END_TIME DECODED_TEXT
6.0 → 8.0 wɔːtɚluːsteɪʃən <<< the correct output
5.5 → 8.5 bɔtɚluːsteɪʃən
5.0 → 9.0 bɔːtɚluːsteɪʃən
4.5 → 9.5 wɔːtɚuːsteɪʃən
4.0 → 10.0 vɔtɚvuːsteɪʃən
3.5 → 10.5 wɔːtɾɚwuːsteɪʃən
3.0 → 11.0 wɔːtɾɚwuːsteɪʃən
2.5 → 11.5 wɔːtɚwuːsteɪʃən
2.0 → 12.0 wɔːtɾɚvuːsteɪʃən
1.5 → 12.5 wɔtɚvuːsteɪʃən
1.0 → 13.0 voːtɚvuːsteɪʃən
0.5 → 13.5 voːtɚvuːsteɪʃən
What can be the reason? How to fix this?