Because I love the mindset within the community of the Wave2vec sprint I’d like to share some ideas about improving the accuracy of asr and making more stable for production.
I would be happy to discuss about.
In some experiments I tested many systems and algorithms, but especially one reached amazing accuracy.
When having the transcribed text from the wave2vec model we go many ways to correct them. Either dictionary search for each word an automatically use the nearest result, or an seq2seq model, but what about an hybrid solution based on two or three parts ?
Part 1: token classification, to recognize which words are wrong in the context. Instead of human names or locations just classify wrong or right.
Part 2: When we have the wrong tokens let’s check an dictionary for similar alternative, either using bm25 (tested) or dpr neural search (untested)
Part 3: When we have some alternatives for each token we can either use the best scored result or let an multiple-choice trained model decide. In my quick tests I decided using the best alternative, but definitely need to check the multiple choice variant.
With this 3 steps
- Token classification
- Dictionary search using bm25 like algorithms
- Replacing false tokens with best scored alternative
I reached amazing results and up to WER of 1.3%
At the moment my code is pretty noisy and I would like to start from zero again to build an clean library based on huggingface models, or maybe just an community notebook, depends on your feedback
I’d like to hear what you think about, maybe you have much better idea ?
Maybe someone is interested in joining this research ?