I fine-tuned the wavLM-large model for the ASR task following the Fine-Tune Wav2Vec2 for English ASR with Transformers tutorial on Italian, French and Spanish data represented in phonemes implementing a custom rule-based tokenizer (based on Wav2Vec2PhonemeCTCTokenizer) that splits words in short subwords in this format
in breve tempo il curioso battello > in | b r E ve | t E m po | il | ku r j o so | ba t t E llo |
so, basically, separating each token by space and each word by pipe.
To evaluate my model I’m using the Phoneme Error Rate metric, for which I get a fair score, and the WER, that, on the other side, tends to be very high (~60 WER, ~9 PER).
Since the training data is represented in phonemes and I’m calculating the WER on the recognized words transformed back in graphemes, I know that part of the errors are generated by some cases of ambiguity that arise from this transposition (e.g. two different graphemic words that are transcribed phonetically in the same way), but even estimating the amount of these error the WER is too high.
After having a look at the predictions, I see that often word boundaries are not recognized correctly
reference = ‘in | b r E ve | t E m po | il | ku r j o so | ba t t E llo |’
prediction = 'in | b r E ve t E m po | il | ku r j o so | ba t t E llo | ’
My suspicion is that the high WER is due to this reason.
The thing that I don’t get is: why are word boundaries often misrecognized even though the words should be seen and learned with the correct boundaries during the training?
This issue also happens with pretty common words.
Thanks in advance for any comment!