I am finetuning wav2vec large model for non-native English speakers’ data. My data includes many unclear speech, and it is marked with special symbols () in my transcription. I would like to treat them as unkown words and train a garbage model. If anyone has advice, could you please share it?
I tried @patrickvonplaten’s tutorial , patrickvonplaten (Patrick von Platen), (thanks a lot ) but I am still unclear about how to treat .
I found that the unkown token is manually added to the vocabulary by the following line.
vocab_dict["[UNK]"] = len(vocab_dict)
However, I am unclear how they are trained. In the above tutorial, the special symbols except alphabet and space are all removed. Does the training data include any training cases for"[UNK]" ? If yes, what is converted to “[UNK]”? If not, the how “[UNK]” is trained?
I found that if I assign a special symbol not used in my vocabulary list (e.g., @) to the unkown areas, and it converted to unk_token by Wav2Vec2CTCTokenizer
Is it the correct way to treat unkown tokens? Does anyone have experience, could you please share it?