Language model for wav2vec2.0 decoding

@patrickvonplaten @Beau
Hi guys, I had also implemented a simple KenLM with beam search decoding for Wav2Vec2CTC using: GitHub - parlance/ctcdecode: PyTorch CTC Decoder bindings

You may find it useful

Here is the repo:


This was very helpful. Thanks for posting it.

Could you please share the code you used for distilling wav2vec2?

Hey guys,

I’ve done some benchmarking with the pyctcdecode library and I think works quite well actually in combination with transformers.

Here is a repo where you can find some comparisons between Wav2Vec2 + LM vs. Wav2Vec2 + no LM as well as all the necessary scripts to run the eval: GitHub - patrickvonplaten/Wav2Vec2_PyCTCDecode: Small repo describing how to use Hugging Face's Wav2Vec2 with PyCTCDecode


We now have an in-detail blog post explaining step-by-step how to create an n-gram language model and how to integrate it with Transformers and pctcdecode here:


@patrickvonplaten - thanks for that. I have a wav2vec2 model and a binary kenlm language model, both which i build without using huggingface. I am interested in porting my model to huggingface. Is this currently possible or not yet?

Sorry for resurrecting this, but seems like a right place to ask - has anyone tried CTCDecoding with other than KenLM models? Is it too slow and are there any public attempts or examples on that? Sorry if that’s a stupid question, I realize that n-grams are much faster, but perhaps in some use cases (like mine) precision is more important than speed and it just seems that models like GPT-Neo could achieve much greater precision than an N-gram.

Patrick, I am preparing to use Wav2Vec2 with the language model you describe here - for my solution I particularly like pyctcdecode’s “hotwords” function. I noticed, however, that Kenlm is destributed under the lesser gnu public license, which is much less permissive than the other licenses in the chain in terms of commercial use. Do you happen to have any intuitions about whether use of .arpa files produced by Kenlm and then used by pyctcdecode/Wav2Vec2 forces inheritance of the LGPL? Thanks!

Hi everyone!

I tried to use 3-gram language model that has been trained using the kaldi-asr toolkit to make Wav2Vec2ProcessorWithLM instead of using kenlm-based LMs, but I received error below:

OSError: Cannot read model ...
(lm/ in void lm::ReadBackoff(util::FilePiece&, lm::Prob&) threw FormatLoadException.  
Non-zero backoff -1.113 provided for an n-gram that should have no backoff in the 3-gram at byte 4082800 Byte: 4082800)

Is it a good idea to use kaldi-based LM instead of using kenlm? (both have .arpa format) @patrickvonplaten
thanks for your attention

hey carrotpie, hope you’re doing fine
have you found a solution to your question here?
is it possible to wrap a wav2vec2 model with an LM other than KenLM?
have you gotten any experience?

hey sara
have you found a solution to your question here?

No, not yet , I’m waiting!

Hey, I have not tried to do it myself, I don’t think I would be skillful enough for that kind of task. The closest thing I found was a feature called “neural rescoring” within NVIDIA’s NeMo framework: GitHub - NVIDIA/NeMo: NeMo: a toolkit for conversational AI
Perhaps someone could hack around with code from Nemo and port it to Transformers, or at least get inspiration from it.