Spanish ASR: Fine-Tuning Wav2Vec2

patrickvonplaten · March 18, 2021, 3:34pm

Very good question & It’s great that you already took a look at the dataset. I think you have the following three options here:

Remove all characters which clearly don’t belong to the Spanish languages from both the training and the test data.
Don’t remove those characters from the training and test dataset, but remove them from the vocabulary. In this scenario make sure to add the "[UNK]" token to the vocab and define it as unk_token="[UNK]" when instantiating the tokenizer. This way the model will simply learn to classify all such tokens to [UNK], but you don’t have to significantly change the training data.
Just add all such tokens to the vocab.

Obviously all methods have their advantages and disadvantages. I would tend to either option 2) or 1) with option 2) probably being my preferred option.

The reason is that removing all those characters might change the meaning of the sentence so that the not-removed part of the sentence doesn’t make that much sense anymore. Teaching the model to simply classify unknown sounds (symbol of other language) to unknown symbols (the "[UNK]" token representing the symbol of the other language) makes most sense here IMO. Also it shouldn’t really affect your final WER as the model would have in the most likely scenario not classified those symbols correctly anyways.

Option 1) is also very much possible by just fully removing those data samples. This should very much be feasible for high resource languages like Spanish.

Option 3) has the big advantage of making training slow and less stable. This option really only makes sense if each of the added characters occurs a significant amount in your training which I believe is not the case here.

Topic		Replies	Views
How to use unk_token (unknown token) during wav2vec model finetuning Models	2	3739	May 19, 2022
XLSR-Wav2Vec2 with punctuation Research	1	1388	October 12, 2022
Hindi ASR: Fine-Tuning Wav2Vec2 Languages at Hugging Face	19	3005	January 4, 2022
German ASR: Fine-Tuning Wav2Vec2 Languages at Hugging Face	17	3681	February 18, 2022
Wav2vec2-large-xlsr-53 🤗Transformers	4	812	July 26, 2022

Spanish ASR: Fine-Tuning Wav2Vec2

Related topics