Boosting Wav2Vec2-xls-r with an N gram decoder using the transcripts used to train wav2vec2

I would be really thankful if you could answer the following queries. I have been finetuning wav2vec2 and then boosting the performance with an n-gram decoder.

  1. If the wav2vec2 model is trained on DatasetA, then will it be wise to train an n-gram lm with the transcripts of DatasetA to boost the performance of wav2vec2 model? I have tried this and it reduces WER and CER significantly as expected. However, will the model be less generalized or perform worse in some unseen data? OR, is it better to train the n-gram lm with text data similar to DatasetA but not exactly DatasetA?
  2. What are some preprocessing and post processing modules that I can use to improve the performance of wav2vec2 model? Can you please point me to some resources?
  3. Will I get any benefit if I create my own tokenizer for my own dataset? Will it increase the performance of my model? Can I finetune pretrained models using this tokenizer or should I have to train from scratch?


Hey @Rakib,

To answer 1.)
That depends on your use case. If you just want to have the best, “fair” model evaluated on DatasetA “test” then I would train your LM on DatasetA “train”. If you want the most general model then I’d try to use as much diverse language data as possible. As a reference maybe this blog might help: Boosting Wav2Vec2 with n-grams in 🤗 Transformers

In general though it’s not unusual to see such improvements and as long as you don’t use the test transcripts in your LM training data, it should be fine!

  1. You could use spelling corrector such as oliverguhr/spelling-correction-english-base · Hugging Face as postprocessors amongst other and noise cancel filters as pre-processors

  2. No I wouldn’t recommend creating your own tokenizer. Instead I’d just create a character based look up table fro Wav2Vec2 as described here: Fine-Tune XLSR-Wav2Vec2 for low-resource ASR with 🤗 Transformers

1 Like