Boosting Wav2Vec2-xls-r with an N gram decoder using the transcripts used to train wav2vec2

Hey @Rakib,

To answer 1.)
That depends on your use case. If you just want to have the best, “fair” model evaluated on DatasetA “test” then I would train your LM on DatasetA “train”. If you want the most general model then I’d try to use as much diverse language data as possible. As a reference maybe this blog might help: Boosting Wav2Vec2 with n-grams in 🤗 Transformers

In general though it’s not unusual to see such improvements and as long as you don’t use the test transcripts in your LM training data, it should be fine!

  1. You could use spelling corrector such as oliverguhr/spelling-correction-english-base · Hugging Face as postprocessors amongst other and noise cancel filters as pre-processors

  2. No I wouldn’t recommend creating your own tokenizer. Instead I’d just create a character based look up table fro Wav2Vec2 as described here: Fine-Tune XLSR-Wav2Vec2 for low-resource ASR with 🤗 Transformers

1 Like