ASR spell correction

flozi00 · March 25, 2021, 9:59pm

Because I love the mindset within the community of the Wave2vec sprint I’d like to share some ideas about improving the accuracy of asr and making more stable for production.
I would be happy to discuss about.

In some experiments I tested many systems and algorithms, but especially one reached amazing accuracy.

When having the transcribed text from the wave2vec model we go many ways to correct them. Either dictionary search for each word an automatically use the nearest result, or an seq2seq model, but what about an hybrid solution based on two or three parts ?
Part 1: token classification, to recognize which words are wrong in the context. Instead of human names or locations just classify wrong or right.
Part 2: When we have the wrong tokens let’s check an dictionary for similar alternative, either using bm25 (tested) or dpr neural search (untested)

Part 3: When we have some alternatives for each token we can either use the best scored result or let an multiple-choice trained model decide. In my quick tests I decided using the best alternative, but definitely need to check the multiple choice variant.

With this 3 steps

Token classification
Dictionary search using bm25 like algorithms
Replacing false tokens with best scored alternative
I reached amazing results and up to WER of 1.3%

At the moment my code is pretty noisy and I would like to start from zero again to build an clean library based on huggingface models, or maybe just an community notebook, depends on your feedback

I’d like to hear what you think about, maybe you have much better idea ?
Maybe someone is interested in joining this research ?

birgermoell · March 25, 2021, 10:20pm

Amazing idea. I would love this. Do you have any code I can check out?

flozi00 · March 25, 2021, 10:32pm

As described on slack I’m sorry that I cant share These Codes at the moment

gchhablani · March 25, 2021, 10:59pm

Hi @flozi00

I was thinking about the exact same thing a few days ago. @voidful tried using GPT with this, and combined the probabilities using an element-wise product. This improved the performance by 10 points (WER). However, we discussed that we can use BART/T5/XLNet on top of this, or train a model to improve the results. I haven’t had the chance to try these yet.

I also thought about an end-to-end system but that looks very tough to implement because CTC Loss needs to function properly. I think it is a very interesting avenue, and I’d love to explore more. Definitely interested in helping build a library solely for language correction based on pre-trained huggingface models

It would be really really cool if we could fine-tune both the models simultaneously. I’m not 100% sure but decoding used by Wav2Vec2 will break the computational graph and it will be difficult to perform any backpropagation.

Hence, there can be two stages at which language correction is done:

Before the decoding: This means that the model will be trained end to end in some fashion, or we combine the CTC Loss on XLSR’s outputs and then on the next Language Model (some encoder-decoder) which learns to correct it at the same time. Then the decoding takes place, in case it is needed (which it will be most probably). This should be done on a character level LM.
After the decoding: This will use token/word-level LMs and the predictions from XLSR, in some encoder-decoder fashion.

We can test these cases and see which performs better.

gchhablani · March 25, 2021, 11:21pm

For token classification, however, there is one potential challenge - alignment. For example, you can’t always tell whether the token is fully correct or partially correct. Additionally, some tokens for a word can have either more corresponding tokens in the correct word or less.

Example : touch and tuch. Suppose these are tokenized into tou,#ch & t, #u, #ch. How will you classify right or wrong in this case?

joaoalvarenga · March 26, 2021, 12:07am

That’s actually a pretty good idea!

Are you familiar with shallow/deep fusion?

flozi00 · March 26, 2021, 7:06am

Didn’t had problem yet, but maybe it could be solved by tokenization by space

flozi00 · March 26, 2021, 7:08am

Not really

flozi00 · March 26, 2021, 7:21am

My next step would be creating an repo and starting with dataset generation.
The dataset should be generated by the trained ASR model itself, so the correction learns automatically the mistakes the transcription does.
I think it would be pretty cool to provide multiple strategys, so every idea would be done

gchhablani · March 26, 2021, 7:43am

What do you mean?

gchhablani · March 26, 2021, 7:44am

I’d love to collaborate .

Are you only thinking English for now? Since most models would be based on English.

We can also look into Character vs Word based models.

gchhablani · March 26, 2021, 7:46am

I’m familiar with shallow/deep fusion for multi-modal systems. Not sure how that applies here.

flozi00 · March 26, 2021, 9:50am

No, it should be multilingual

flozi00 · March 26, 2021, 12:07pm

Hast anyone experience with this ?
The online demo looks good for our case.
I will start today with dataset generation for seq2seq with t5 and neuspell
Sharing the repo here later.
Do you want to change communication to slack ?

joaoalvarenga · March 26, 2021, 12:10pm

Shallow Fusion is a very common technique in ASR, is basically combine an acoustic model as wav2vec2 with a pretrained language model, you train it with the same vocabulary as the acoustic model and at the inference time you combine both output combining the logits of the two models by:

am_y = p(y|x)
lm_y = lm(y|x)
y = argmax log am_y + λlog lm_y

Using this you can improve a lot the model performance.
It’s the same idea we’re discussing here.

This paper is a good begining: https://arxiv.org/pdf/1807.10857.pdf

gchhablani · March 26, 2021, 12:33pm

Please create a slack group

You can send me an invite @ chhablani.gunjan@gmail.com

gchhablani · March 26, 2021, 12:34pm

This means we’ll have to pretrain character-level/word-level LMs/Generative Models.

gchhablani · March 26, 2021, 12:36pm

I think @voidful suggested a similar thing. He takes a product of the probabilities from Wav2Vec2 and GPT-2 after aligning, and then uses decoding.

Not sure what would deep fusion mean here.

flozi00 · March 26, 2021, 2:29pm

https://join.slack.com/t/asr-transformers/shared_invite/zt-o6x1idmu-sSyU6oRDOzXgFCkSiwLQFg

flozi00 · March 26, 2021, 3:14pm

Could you provide some code for ?

Topic		Replies	Views
Pre-training/fine-tuning Seq2Seq model for spelling and/or grammar correction in English Flax/JAX Projects	7	7198	October 11, 2021
Correct Wav2Vec2 ASR output Beginners	0	132	December 21, 2023
Ideas to correct Wav2Vec2 transcription results Beginners	1	1007	May 11, 2021
Pre-training/fine-tuning Seq2Seq model for spelling and/or grammar correction in French Flax/JAX Projects	6	2014	August 11, 2021
Swiss-German ASR: Fine-Tuning Wav2Vec-XLSR Languages at Hugging Face	0	555	March 18, 2021

ASR spell correction

Related topics