Issues in recognition of word boundaries - fine-tuned WavLM and subword tokenizer

spicci · November 28, 2022, 6:59pm

Hi,
I fine-tuned the wavLM-large model for the ASR task following the Fine-Tune Wav2Vec2 for English ASR with Transformers tutorial on Italian, French and Spanish data represented in phonemes implementing a custom rule-based tokenizer (based on Wav2Vec2PhonemeCTCTokenizer) that splits words in short subwords in this format

so, basically, separating each token by space and each word by pipe.

To evaluate my model I’m using the Phoneme Error Rate metric, for which I get a fair score, and the WER, that, on the other side, tends to be very high (~60 WER, ~9 PER).

Since the training data is represented in phonemes and I’m calculating the WER on the recognized words transformed back in graphemes, I know that part of the errors are generated by some cases of ambiguity that arise from this transposition (e.g. two different graphemic words that are transcribed phonetically in the same way), but even estimating the amount of these error the WER is too high.

After having a look at the predictions, I see that often word boundaries are not recognized correctly

My suspicion is that the high WER is due to this reason.

The thing that I don’t get is: why are word boundaries often misrecognized even though the words should be seen and learned with the correct boundaries during the training?
This issue also happens with pretty common words.

Thanks in advance for any comment!

Topic		Replies	Views
Customization of Wav2Vec2CTCTokenizer with rules 🤗Tokenizers	0	397	August 22, 2022
Improving performance of Wav2Vec2 fine tuning with word piece vocabulary Research	5	2996	October 27, 2021
Ideas to correct Wav2Vec2 transcription results Beginners	1	1001	May 11, 2021
Wav2vec2 finetuning and language model Beginners	0	214	October 1, 2023
Word Error Rate in Wav2vec2 Fine Tuning Beginners	0	242	November 18, 2022

Issues in recognition of word boundaries - fine-tuned WavLM and subword tokenizer

Related topics