I am finetuning wav2vec large model for non-native English speakers’ data. My data includes many unclear speech, and it is marked with special symbols () in my transcription. I would like to treat them as unkown words and train a garbage model. If anyone has advice, could you please share it?
I found that the unkown token is manually added to the vocabulary by the following line.
vocab_dict["[UNK]"] = len(vocab_dict)
However, I am unclear how they are trained. In the above tutorial, the special symbols except alphabet and space are all removed. Does the training data include any training cases for"[UNK]" ? If yes, what is converted to “[UNK]”? If not, the how “[UNK]” is trained?
I found that if I assign a special symbol not used in my vocabulary list (e.g., @) to the unkown areas, and it converted to unk_token by Wav2Vec2CTCTokenizer
Is it the correct way to treat unkown tokens? Does anyone have experience, could you please share it?
That’s a good question! If you want to treat them as unknown tokens, I think you can either do your approach (replace them with a token that’s not in the vocab like @) or add a new token to the vocabulary, maybe @ for noise.
But overall your approach sounds good and should work
@patrickvonplaten Thanks for your reply and advice. I also found you and pcueng’s discussion about Spanish ASR with out-of-vocabulary (non-spanish character handling) in the transcriptions. I linked this just in case other people may be interested. Thank you for discussions.