Language model for wav2vec2.0 decoding

Hello, I implemented wav2vec2.0 code and a language model is not used for decoding. How can I add a language model (let’s say a language model which is trained with KenLM) for decoding @patrickvonplaten ?

thanks in advance.

Note: I also opened an issue, but redirected here.


Hey Emre!

Yeah good question - we currently don’t support evaluating with a language model, but we plan on adding this functionality soon! It’s sadly not that trivial to decode a CTC model with a language model. I’ll try to keep you posted for updates here!


Assuming that one has a kenlm model already, am I wrong to assume that’s it’s just a matter of giving the wav2vec2 output logits as argument to the ctcdecode main function, exemplified here: GitHub - parlance/ctcdecode: PyTorch CTC Decoder bindings?

Or is there more to it than that?

1 Like

@EmreOzkose Good question i think. But i don’t start this as professional level. I’m currently searching on this. :slightly_smiling_face:

Hi all, I’ve been experimenting kenlm with wav2vec2 here is the notebok
I dont know if this is a proper implementation, but it works!
I also still need to cleanup some stuff like vocab & other thing.


@Wikidepia Can you share how much it improved your WER score? Also, did you tried character level LM as well?

It improved from 14.2 to 9.2. I haven’t tried character level LM :sweat_smile:


I added support for KenLM using the flashlight library here: Wav2Vec-Wrapper/ at main · Edresson/Wav2Vec-Wrapper · GitHub

It supports the use of the binary file instead of the arpa and it is also possible to restrict the model’s vocabulary.


Thank you @Wikidepia and @Edresson. I will check out.

Hi Patrick!

Any news on language model evaluation support?

See Added Feature: Prefix decoding for wav2vec2 models by deepang17 · Pull Request #11606 · huggingface/transformers · GitHub


Wiki, when I apply your code, it predicts only spaces and -. Is there any reason for it?

@jolurf you can also use this decoder (GitHub - parlance/ctcdecode: PyTorch CTC Decoder bindings). Take the labels from your tokenizer and create a n-gram language model with KenLM. After that you can feed the logits from your Wav2Vec2 model into the decoder.

@patrickvonplaten Are there any updates on the transformer language model?

Voidful suggested to combine the wav2vec2 probabilities with those of the gpt2 model:
( huggingface_notebook/xlsr_gpt.ipynb at main · voidful/huggingface_notebook · GitHub )

However, the CTC Vocab seems to match the GPT Vocab. Unfortunately, this is not the case in English. Is there already a solution?

If this discussion is still ongoing, then there is a pull request Added Feature: Prefix decoding for wav2vec2 models by deepang17 · Pull Request #11606 · huggingface/transformers · GitHub currently open, and as @ChristophBensch mentions a means of using KenLM from GitHub - parlance/ctcdecode: PyTorch CTC Decoder bindings. We have an example of this at GitHub - techiaith/docker-wav2vec2-xlsr-ft-cy: Hyfforddi modelau adnabod lleferydd Cymraeg wav2vec2 a KenLM a'u darparu drwy weinydd gwasanaeth API // Train wav2vec2 and KenLM models for Welsh language speech recognition and/or provide via a simple API server. that’s reduced our WER score for Welsh from 25% to 15%. Since our scripts use HuggingFace’s OSCAR dataset, they should be easily adaptable to train and optimize LMs for other lesser resourced languages as well.


Thanks @DewiBrynJones for the implementation, love the idea to have readmes in your local language and reference an English version :slight_smile:

1 Like

Hi all! As advised by @andersgb1 I used a kenlm n-gram language model on top of a distilled wav2vec2 that I trained and it improved my WER (26 → 12.6). If you guys are interested here’s the notebook (executes seamlessly on colab) OthmaneJ/distil-wav2vec2 · Hugging Face


Could you please share the code you used for distilling wav2vec2?

So to use the wav2vec2 with gpt2 for English, would we have to just match the vocab used in the wav2vec2 with the vocab used in the gpt2?

Is integrating an LM for wav2vec2 basically pointless now with the release of HuBERT? Which if I understand correctly is both an audio and language model at the same time? facebook/hubert-xlarge-ll60k · Hugging Face

I’m trying to achieve sub-5% (surpassing human performance) WER, but I don’t know if after I fine-tune this Hubert on my own data it will achieve that or not, because I’m not sure about the language model thing.

Does it also need an integration with a language model to actually make it perform well?