Inference of finetuned wav2vec2-xls-r-300m model using the ASR pipeline does not remove special tokens

I have finetuned a wav2vec2-xls-r-300m model on Hindi language using the mozilla-foundation/common_voice_7_0 dataset. But when I infer the model using the ASR pipeline special tokens are not skipped. And even in the Huggingface demo special characters are retained.

Model: shivam/xls-r-hindi (shivam/xls-r-hindi 路 Hugging Face)
Another example of same issue in different model: Harveenchadha/vakyansh-wav2vec2-hindi-him-4200 (Harveenchadha/vakyansh-wav2vec2-hindi-him-4200 路 Hugging Face)

The tokenizer of wav2vec2-xls-r-300m is 鈥淲av2Vec2CTCTokenizer鈥 and in the asr pipeline if 鈥淐TC鈥 is present in tokenizer class name then skip_special_tokens is set to False (transformers/automatic_speech_recognition.py at v4.15.0 路 huggingface/transformers 路 GitHub), because of this special tokens are included in the output when using asr pipeline.

We need to not skip special tokens for CTC (wav2vec2 in particular) because of the [PAD] token.

HELLO can only be transcribed because the CTC tokens are H, E, L, PAD, L, L, L, O, O for instance.

It seems here that maybe <s> got confused with <pad> and hence is not properly skipped during decoding. Could that be it ? Also I am unsure to know how it would behave on the unicode scripts you are using (I hope if we remove <s> by properly using the <pad> token everything will work magically, but I can鈥檛 be sure.

Found the issue. While building vocabulary we replace the blank token " " by pipe symbol as delimiter ("|"). But if the dataset already contains the pipe symbol, which is common in Hindi language as Hindi language uses a similar character 鈥溹イ鈥 as a full stop, in this case we will get two pipe symbols in the vocabulary. This will mess up the further process of creating the dictionary which is adding special tokens. Specifically, the bos token 鈥<s>鈥 and the pad token 鈥淸PAD]鈥 will get the same index in vocabulary. This will cause the index with [PAD] token to be replaced by <s> and the output of Wav2Vec2ForCTC model to contain the 鈥<s>鈥 token.

Fix in training script: fix: Vocabulary build issue in run_speech_recognition.py 路 shivamm7/transformers@59635cb 路 GitHub