How to create Wav2Vec2 With Language model

Now that language model boosted decoding is possible for Wav2Vec2 (https://twitter.com/PatrickPlaten/status/1468999507488788480 and patrickvonplaten/wav2vec2-large-xlsr-53-spanish-with-lm · Hugging Face

it’s important to know How can one create a Wav2Vec2 + LM repo?.

Let’s explain (hopefully this is simpler in the future):

  1. Install kenlm:
    The best guide to build kenlm is this github gist here: kenlm/BUILDING at master · kpu/kenlm · GitHub IMO

  2. Create an ngram model:
    This is explained quite well here: GitHub - kpu/kenlm: KenLM: Faster and Smaller Language Model Queries
    I wrote a short python script that allows to quickly create a ngram from a text corpus of common voice:

#!/usr/bin/env python3
from datasets import load_dataset
import os
import argparse

parser = argparse.ArgumentParser()
parser.add_argument(
    "--language", default="polish", type=str, required=True, help="Language to run comparison on. Choose one of 'polish', 'portuguese', 'spanish' or add more to this script."
)
parser.add_argument(
    "--path_to_ngram", type=str, required=True, help="Path to kenLM ngram"
)
args = parser.parse_args()

ds = load_dataset("multilingual_librispeech", f"{args.language}", split="train")

with open("text.txt", "w") as f:
    f.write(" ".join(ds["text"]))

os.system(f"./kenlm/build/bin/lmplz -o 5 <text.txt > {args.path_to_ngram}")

## VERY IMPORTANT!!!:
# After the language model is created, one should open the file. one should add a `</s>`
# The file should have a structure which looks more or less as follows:

# \data\
# ngram 1=86586
# ngram 2=546387
# ngram 3=796581
# ngram 4=843999
# ngram 5=850874

# \1-grams:
# -5.7532206      <unk>   0
# 0       <s>     -0.06677356
# -3.4645514      drugi   -0.2088903
# ...

# Now it is very important also add a </s> token to the n-gram
# so that it can be correctly loaded. You can simple copy the line:

# 0       <s>     -0.06677356

# and change <s> to </s>. When doing this you should also inclease `ngram` by 1.
# The new ngram should look as follows:

# \data\
# ngram 1=86587
# ngram 2=546387
# ngram 3=796581
# ngram 4=843999
# ngram 5=850874

# \1-grams:
# -5.7532206      <unk>   0
# 0       <s>     -0.06677356
# 0       </s>     -0.06677356
# -3.4645514      drugi   -0.2088903
# ...

# Now the ngram can be correctly used with `pyctcdecode`

See: Wav2Vec2_PyCTCDecode/create_ngram.py at main · patrickvonplaten/Wav2Vec2_PyCTCDecode · GitHub

Feel free to copy those lines of code. Multi-Lingual librispeech is already a very clean text corpus. You might want to pre-process other text corpora to not include any punctuation etc…

As an example this step created the Spanish 5-gram here: kensho/5gram-spanish-kenLM · Hugging Face

  1. Now we should load the language model in a PyCTCBeamSearchDecoder as this is the format we need. Here one should be very careful to choose exactly the same vocabulary as the Wav2Vec2’s tokenizer vocab.

At first we should pick a fine-tuned Wav2Vec2 model that we would like to add a language model to.
Let’s choose: jonatasgrosman/wav2vec2-large-xlsr-53-spanish · Hugging Face

Now we instantiate a BeamSearchDecoder and save it to a folder wav2vec2_with_lm.

E.g. you can run this code:

from transformers import AutoTokenizer
from pyctcdecode import build_ctcdecoder

tokenizer = AutoTokenizer.from_pretrained("jonatasgrosman/wav2vec2-large-xlsr-53-spanish")

vocab_dict = tokenizer.get_vocab()
sorted_dict = {k: v for k, v in sorted(vocab_dict.items(), key=lambda item: item[1])}

decoder = build_ctcdecoder(
    list(sorted_dict.keys()),
    args.path_to_ngram,
)

decoder.save_to_dir("wav2vec2_with_lm")

Now we should have saved the following files in wav2vec2_with_lm:

- language_model
  - attrs.json
  - kenLM.arpa
  - unigrams.txt
- alphabet.json

That’s it! Now all you need to do is to upload this to your Wav2Vec2 model so that the directory structure looks as follows:

Now, all you need to change is your decoding to:

5 Likes

This is cool!
Can you make the inference widget use the language model?

This is awesome!
I have a bunch of KenLM models for different languages trained on sentence piece-tokenized Wikipedia and OSCAR here if you find them useful: edugp/kenlm at main

1 Like

Hi @patrickvonplaten

Were you able to install kenlm in spaces?

This is awesome. Thank you!

@Harveenchadha you can install it adding this line to your requirements.txt:
https://github.com/kpu/kenlm/archive/master.zip

1 Like

Yes working on it! :slight_smile:

If someone is trying this with LM like me and has problems with AutoProcessor / AutoModelForCTC
install transformers from @patrickvonplaten commit [AutoProcessor] Correct AutoProcessor and automatically add processor… · huggingface/transformers@a139288 · GitHub

At least I was able to load my model ( RASMUS/wav2vec2-xlsr · Hugging Face ) with LM after that fix

Hi,
Thank you for sharing resources !

I have come across the paper of wav2vec2, and according to the paper it seems like using a Transformer as a language model yields the best results in terms of WER.
However, I do not find any other reference/resource on the subject, nor did I find the weights of the Transformer used for decoding.

Is it possible that you might have any resource on the subject? (where I could find the weights of such transformer decoder)

Thanks a lot for helping the community :hugs:

While predicting the text with Model with LM , i am getting below error.

Hey @yaswanth,

Sorry to reply so late. Could you add a fully reproducible code sample? This is as minimal as possible? :slight_smile:

code:


Model_Files:
image
Error:

Hi @patrickvonplaten @philschmid

Is it possible to deploy Wav2Vec2 with a KenLM Language Model in Amazon SageMaker? I was following this article https://www.philschmid.de/automatic-speech-recognition-sagemaker but i couldn’t find an option in the sagemaker HF library to use a language model.

1 Like

Hey @diegoseto,

Yes, it is possible by deploying your model from Amazon S3 with a custom inference.py including the inference code.
Here is an example notebook for using a sentence-transformers. You just need to modify the inference.py and how the model.tar.gz is created.

1 Like

Cool. Thank you, i will try this. Thanks for the quick reply too :smiley:

1 Like

Hi everyone - I’m pretty sure I know the answer to this but just making sure. Has anyone managed to use the processor with a kenlm language model on Windows? I created my .arpa file on a Linux os hoping to be able to use it with wav2vec2 on Windows without actually building/installing kenlm, but this does not appear to be the case. Even when I create the processor, bundled with the lm, on Linux and then load the processor from_pretrained on my Windows script (based on the blog), it still wants to call kenlm and crashes. Just want to confirm that there is no way around this and that, at this time, Linux is mandatory for using wav2vec2 with a language model, even with an already existing .arpa file that has been created somewhere else.
thanks!
Jonathan