Inference of finetuned wav2vec2-xls-r-300m model using the ASR pipeline does not remove special tokens

shivam · January 21, 2022, 11:41am

I have finetuned a wav2vec2-xls-r-300m model on Hindi language using the mozilla-foundation/common_voice_7_0 dataset. But when I infer the model using the ASR pipeline special tokens are not skipped. And even in the Huggingface demo special characters are retained.

Model: shivam/xls-r-hindi (shivam/xls-r-hindi · Hugging Face)
Another example of same issue in different model: Harveenchadha/vakyansh-wav2vec2-hindi-him-4200 (Harveenchadha/vakyansh-wav2vec2-hindi-him-4200 · Hugging Face)

The tokenizer of wav2vec2-xls-r-300m is “Wav2Vec2CTCTokenizer” and in the asr pipeline if “CTC” is present in tokenizer class name then skip_special_tokens is set to False (transformers/automatic_speech_recognition.py at v4.15.0 · huggingface/transformers · GitHub), because of this special tokens are included in the output when using asr pipeline.

Narsil · January 21, 2022, 12:45pm

We need to not skip special tokens for CTC (wav2vec2 in particular) because of the [PAD] token.

HELLO can only be transcribed because the CTC tokens are H, E, L, PAD, L, L, L, O, O for instance.

It seems here that maybe <s> got confused with <pad> and hence is not properly skipped during decoding. Could that be it ? Also I am unsure to know how it would behave on the unicode scripts you are using (I hope if we remove <s> by properly using the <pad> token everything will work magically, but I can’t be sure.

shivam · January 22, 2022, 7:33am

Found the issue. While building vocabulary we replace the blank token " " by pipe symbol as delimiter ("|"). But if the dataset already contains the pipe symbol, which is common in Hindi language as Hindi language uses a similar character “।” as a full stop, in this case we will get two pipe symbols in the vocabulary. This will mess up the further process of creating the dictionary which is adding special tokens. Specifically, the bos token “<s>” and the pad token “[PAD]” will get the same index in vocabulary. This will cause the index with [PAD] token to be replaced by <s> and the output of Wav2Vec2ForCTC model to contain the “<s>” token.

Fix in training script: fix: Vocabulary build issue in run_speech_recognition.py · shivamm7/transformers@59635cb · GitHub

Topic		Replies	Views
Customization of Wav2Vec2CTCTokenizer with rules 🤗Tokenizers	0	397	August 22, 2022
Finetunig of wav2vec2-xls-r-300m outputs invalid words for Bengali data Models	6	684	February 1, 2023
Finetuning wav2vec2-large-xlsr-53 only outputs blank labels Models	6	1219	March 29, 2023
Different versions of 'wav2vec2' model and their differences Beginners	1	1512	August 7, 2021
How to use unk_token (unknown token) during wav2vec model finetuning Models	2	3753	May 19, 2022

Inference of finetuned wav2vec2-xls-r-300m model using the ASR pipeline does not remove special tokens

Related topics