Spanish ASR: Fine-Tuning Wav2Vec2

Hey @pcuenq,

Very good question & It’s great that you already took a look at the dataset. I think you have the following three options here:

  1. Remove all characters which clearly don’t belong to the Spanish languages from both the training and the test data.

  2. Don’t remove those characters from the training and test dataset, but remove them from the vocabulary. In this scenario make sure to add the "[UNK]" token to the vocab and define it as unk_token="[UNK]" when instantiating the tokenizer. This way the model will simply learn to classify all such tokens to [UNK], but you don’t have to significantly change the training data.

  3. Just add all such tokens to the vocab.

Obviously all methods have their advantages and disadvantages. I would tend to either option 2) or 1) with option 2) probably being my preferred option.

The reason is that removing all those characters might change the meaning of the sentence so that the not-removed part of the sentence doesn’t make that much sense anymore. Teaching the model to simply classify unknown sounds (symbol of other language) to unknown symbols (the "[UNK]" token representing the symbol of the other language) makes most sense here IMO. Also it shouldn’t really affect your final WER as the model would have in the most likely scenario not classified those symbols correctly anyways.

Option 1) is also very much possible by just fully removing those data samples. This should very much be feasible for high resource languages like Spanish.

Option 3) has the big advantage of making training slow and less stable. This option really only makes sense if each of the added characters occurs a significant amount in your training which I believe is not the case here.

4 Likes