How to use unk_token (unknown token) during wav2vec model finetuning

Su-Youn · May 16, 2022, 10:46am

I am finetuning wav2vec large model for non-native English speakers’ data. My data includes many unclear speech, and it is marked with special symbols () in my transcription. I would like to treat them as unkown words and train a garbage model. If anyone has advice, could you please share it?

I tried @patrickvonplaten’s tutorial , patrickvonplaten (Patrick von Platen), (thanks a lot ) but I am still unclear about how to treat .

I found that the unkown token is manually added to the vocabulary by the following line.
vocab_dict["[UNK]"] = len(vocab_dict)
However, I am unclear how they are trained. In the above tutorial, the special symbols except alphabet and space are all removed. Does the training data include any training cases for"[UNK]" ? If yes, what is converted to “[UNK]”? If not, the how “[UNK]” is trained?

I found that if I assign a special symbol not used in my vocabulary list (e.g., @) to the unkown areas, and it converted to unk_token by Wav2Vec2CTCTokenizer

Is it the correct way to treat unkown tokens? Does anyone have experience, could you please share it?

patrickvonplaten · May 18, 2022, 9:36pm

Hey @Su-Youn ,

That’s a good question! If you want to treat them as unknown tokens, I think you can either do your approach (replace them with a token that’s not in the vocab like @) or add a new token to the vocabulary, maybe @ for noise.

But overall your approach sounds good and should work

Su-Youn · May 19, 2022, 3:13am

@patrickvonplaten Thanks for your reply and advice. I also found you and pcueng’s discussion about Spanish ASR with out-of-vocabulary (non-spanish character handling) in the transcriptions. I linked this just in case other people may be interested. Thank you for discussions.

Topic		Replies	Views
Spanish ASR: Fine-Tuning Wav2Vec2 Languages at Hugging Face	9	2991	March 26, 2021
Fine-Tune Wav2Vec2 for English ASR with 🤗 Transformers article bug Beginners	15	2732	March 7, 2024
Inference of finetuned wav2vec2-xls-r-300m model using the ASR pipeline does not remove special tokens 🤗Transformers	2	520	January 22, 2022
Customization of Wav2Vec2CTCTokenizer with rules 🤗Tokenizers	0	397	August 22, 2022
Wav2vec2CTCTokenizer and vocab.json 🤗Tokenizers	2	1110	October 29, 2022

How to use unk_token (unknown token) during wav2vec model finetuning

Related topics