Because I love the mindset within the community of the Wave2vec sprint I’d like to share some ideas about improving the accuracy of asr and making more stable for production.
I would be happy to discuss about.
In some experiments I tested many systems and algorithms, but especially one reached amazing accuracy.
When having the transcribed text from the wave2vec model we go many ways to correct them. Either dictionary search for each word an automatically use the nearest result, or an seq2seq model, but what about an hybrid solution based on two or three parts ?
Part 1: token classification, to recognize which words are wrong in the context. Instead of human names or locations just classify wrong or right.
Part 2: When we have the wrong tokens let’s check an dictionary for similar alternative, either using bm25 (tested) or dpr neural search (untested)
Part 3: When we have some alternatives for each token we can either use the best scored result or let an multiple-choice trained model decide. In my quick tests I decided using the best alternative, but definitely need to check the multiple choice variant.
With this 3 steps
- Token classification
- Dictionary search using bm25 like algorithms
- Replacing false tokens with best scored alternative
I reached amazing results and up to WER of 1.3%
At the moment my code is pretty noisy and I would like to start from zero again to build an clean library based on huggingface models, or maybe just an community notebook, depends on your feedback
I’d like to hear what you think about, maybe you have much better idea ?
Maybe someone is interested in joining this research ?
15 Likes
Amazing idea. I would love this. Do you have any code I can check out?
As described on slack I’m sorry that I cant share These Codes at the moment
Hi @flozi00
I was thinking about the exact same thing a few days ago. @voidful tried using GPT with this, and combined the probabilities using an element-wise product. This improved the performance by 10 points (WER). However, we discussed that we can use BART/T5/XLNet on top of this, or train a model to improve the results. I haven’t had the chance to try these yet.
I also thought about an end-to-end system but that looks very tough to implement because CTC Loss needs to function properly. I think it is a very interesting avenue, and I’d love to explore more. Definitely interested in helping build a library solely for language correction based on pre-trained huggingface models
It would be really really cool if we could fine-tune both the models simultaneously. I’m not 100% sure but decoding used by Wav2Vec2 will break the computational graph and it will be difficult to perform any backpropagation.
Hence, there can be two stages at which language correction is done:
-
Before the decoding: This means that the model will be trained end to end in some fashion, or we combine the CTC Loss on XLSR’s outputs and then on the next Language Model (some encoder-decoder) which learns to correct it at the same time. Then the decoding takes place, in case it is needed (which it will be most probably). This should be done on a character level LM.
-
After the decoding: This will use token/word-level LMs and the predictions from XLSR, in some encoder-decoder fashion.
We can test these cases and see which performs better.
2 Likes
For token classification, however, there is one potential challenge - alignment. For example, you can’t always tell whether the token is fully correct or partially correct. Additionally, some tokens for a word can have either more corresponding tokens in the correct word or less.
Example : touch
and tuch
. Suppose these are tokenized into tou
,#ch
& t
, #u
, #ch
. How will you classify right or wrong in this case?
That’s actually a pretty good idea!
Are you familiar with shallow/deep fusion?
Didn’t had problem yet, but maybe it could be solved by tokenization by space
My next step would be creating an repo and starting with dataset generation.
The dataset should be generated by the trained ASR model itself, so the correction learns automatically the mistakes the transcription does.
I think it would be pretty cool to provide multiple strategys, so every idea would be done
I’d love to collaborate .
Are you only thinking English for now? Since most models would be based on English.
We can also look into Character vs Word based models.
I’m familiar with shallow/deep fusion for multi-modal systems. Not sure how that applies here.
No, it should be multilingual
1 Like
Hast anyone experience with this ?
The online demo looks good for our case.
I will start today with dataset generation for seq2seq with t5 and neuspell
Sharing the repo here later.
Do you want to change communication to slack ?
1 Like
Shallow Fusion is a very common technique in ASR, is basically combine an acoustic model as wav2vec2 with a pretrained language model, you train it with the same vocabulary as the acoustic model and at the inference time you combine both output combining the logits of the two models by:
am_y = p(y|x)
lm_y = lm(y|x)
y = argmax log am_y + λlog lm_y
Using this you can improve a lot the model performance.
It’s the same idea we’re discussing here.
This paper is a good begining: https://arxiv.org/pdf/1807.10857.pdf
3 Likes
Please create a slack group
You can send me an invite @ chhablani.gunjan@gmail.com
This means we’ll have to pretrain character-level/word-level LMs/Generative Models.
I think @voidful suggested a similar thing. He takes a product of the probabilities from Wav2Vec2 and GPT-2 after aligning, and then uses decoding.
Not sure what would deep fusion mean here.
Could you provide some code for ?