How to use unk_token (unknown token) during wav2vec model finetuning

I am finetuning wav2vec large model for non-native English speakers’ data. My data includes many unclear speech, and it is marked with special symbols () in my transcription. I would like to treat them as unkown words and train a garbage model. If anyone has advice, could you please share it?

I tried @patrickvonplaten’s tutorial , patrickvonplaten (Patrick von Platen), (thanks a lot :smile: ) but I am still unclear about how to treat .

I found that the unkown token is manually added to the vocabulary by the following line.
vocab_dict["[UNK]"] = len(vocab_dict)
However, I am unclear how they are trained. In the above tutorial, the special symbols except alphabet and space are all removed. Does the training data include any training cases for"[UNK]" ? If yes, what is converted to “[UNK]”? If not, the how “[UNK]” is trained?

I found that if I assign a special symbol not used in my vocabulary list (e.g., @) to the unkown areas, and it converted to unk_token by Wav2Vec2CTCTokenizer

Is it the correct way to treat unkown tokens? Does anyone have experience, could you please share it?

Hey @Su-Youn ,

That’s a good question! If you want to treat them as unknown tokens, I think you can either do your approach (replace them with a token that’s not in the vocab like @) or add a new token to the vocabulary, maybe @ for noise.

But overall your approach sounds good and should work

@patrickvonplaten Thanks for your reply and advice. I also found you and pcueng’s discussion about Spanish ASR with out-of-vocabulary (non-spanish character handling) in the transcriptions. I linked this just in case other people may be interested. Thank you for discussions.

1 Like