Hey @Rakib,
To answer 1.)
That depends on your use case. If you just want to have the best, “fair” model evaluated on DatasetA “test” then I would train your LM on DatasetA “train”. If you want the most general model then I’d try to use as much diverse language data as possible. As a reference maybe this blog might help: Boosting Wav2Vec2 with n-grams in 🤗 Transformers
In general though it’s not unusual to see such improvements and as long as you don’t use the test transcripts in your LM training data, it should be fine!
-
You could use spelling corrector such as oliverguhr/spelling-correction-english-base · Hugging Face as postprocessors amongst other and noise cancel filters as pre-processors
-
No I wouldn’t recommend creating your own tokenizer. Instead I’d just create a character based look up table fro Wav2Vec2 as described here: Fine-Tune XLSR-Wav2Vec2 for low-resource ASR with 🤗 Transformers