ASR spell correction

Amazing idea. I would love this. Do you have any code I can check out?

As described on slack I’m sorry that I cant share These Codes at the moment

Hi @flozi00

I was thinking about the exact same thing a few days ago. @voidful tried using GPT with this, and combined the probabilities using an element-wise product. This improved the performance by 10 points (WER). However, we discussed that we can use BART/T5/XLNet on top of this, or train a model to improve the results. I haven’t had the chance to try these yet.

I also thought about an end-to-end system but that looks very tough to implement because CTC Loss needs to function properly. I think it is a very interesting avenue, and I’d love to explore more. Definitely interested in helping build a library solely for language correction based on pre-trained huggingface models :slight_smile:

It would be really really cool if we could fine-tune both the models simultaneously. I’m not 100% sure but decoding used by Wav2Vec2 will break the computational graph and it will be difficult to perform any backpropagation.

Hence, there can be two stages at which language correction is done:

  1. Before the decoding: This means that the model will be trained end to end in some fashion, or we combine the CTC Loss on XLSR’s outputs and then on the next Language Model (some encoder-decoder) which learns to correct it at the same time. Then the decoding takes place, in case it is needed (which it will be most probably). This should be done on a character level LM.

  2. After the decoding: This will use token/word-level LMs and the predictions from XLSR, in some encoder-decoder fashion.

We can test these cases and see which performs better.


For token classification, however, there is one potential challenge - alignment. For example, you can’t always tell whether the token is fully correct or partially correct. Additionally, some tokens for a word can have either more corresponding tokens in the correct word or less.

Example : touch and tuch. Suppose these are tokenized into tou,#ch & t, #u, #ch. How will you classify right or wrong in this case?

That’s actually a pretty good idea!

Are you familiar with shallow/deep fusion?

Didn’t had problem yet, but maybe it could be solved by tokenization by space

Not really

My next step would be creating an repo and starting with dataset generation.
The dataset should be generated by the trained ASR model itself, so the correction learns automatically the mistakes the transcription does.
I think it would be pretty cool to provide multiple strategys, so every idea would be done :slight_smile:

What do you mean? :thinking:

I’d love to collaborate :slight_smile:.

Are you only thinking English for now? Since most models would be based on English.

We can also look into Character vs Word based models.

I’m familiar with shallow/deep fusion for multi-modal systems. Not sure how that applies here.

No, it should be multilingual

1 Like

Hast anyone experience with this ?
The online demo looks good for our case.
I will start today with dataset generation for seq2seq with t5 and neuspell
Sharing the repo here later.
Do you want to change communication to slack ?

1 Like

Shallow Fusion is a very common technique in ASR, is basically combine an acoustic model as wav2vec2 with a pretrained language model, you train it with the same vocabulary as the acoustic model and at the inference time you combine both output combining the logits of the two models by:

am_y = p(y|x)
lm_y = lm(y|x)
y = argmax log am_y + λlog lm_y

Using this you can improve a lot the model performance.
It’s the same idea we’re discussing here.

This paper is a good begining:


Please create a slack group :slight_smile:

You can send me an invite @

This means we’ll have to pretrain character-level/word-level LMs/Generative Models.

I think @voidful suggested a similar thing. He takes a product of the probabilities from Wav2Vec2 and GPT-2 after aligning, and then uses decoding.

Not sure what would deep fusion mean here.

Could you provide some code for ?

Many ideas from automatic post-editing and automatic grammar correction can probably be used here as well. Those are some good keywords to get you started.

1 Like