How to create Wav2Vec2 With Language model

patrickvonplaten · December 10, 2021, 3:49pm

Now that language model boosted decoding is possible for Wav2Vec2 (https://twitter.com/PatrickPlaten/status/1468999507488788480 and patrickvonplaten/wav2vec2-large-xlsr-53-spanish-with-lm · Hugging Face

it’s important to know How can one create a Wav2Vec2 + LM repo?.

Let’s explain (hopefully this is simpler in the future):

Install kenlm:
The best guide to build kenlm is this github gist here: kenlm/BUILDING at master · kpu/kenlm · GitHub IMO
Create an ngram model:
This is explained quite well here: GitHub - kpu/kenlm: KenLM: Faster and Smaller Language Model Queries
I wrote a short python script that allows to quickly create a ngram from a text corpus of common voice:

#!/usr/bin/env python3
from datasets import load_dataset
import os
import argparse

parser = argparse.ArgumentParser()
parser.add_argument(
    "--language", default="polish", type=str, required=True, help="Language to run comparison on. Choose one of 'polish', 'portuguese', 'spanish' or add more to this script."
)
parser.add_argument(
    "--path_to_ngram", type=str, required=True, help="Path to kenLM ngram"
)
args = parser.parse_args()

ds = load_dataset("multilingual_librispeech", f"{args.language}", split="train")

with open("text.txt", "w") as f:
    f.write(" ".join(ds["text"]))

os.system(f"./kenlm/build/bin/lmplz -o 5 <text.txt > {args.path_to_ngram}")

## VERY IMPORTANT!!!:
# After the language model is created, one should open the file. one should add a `</s>`
# The file should have a structure which looks more or less as follows:

# \data\
# ngram 1=86586
# ngram 2=546387
# ngram 3=796581
# ngram 4=843999
# ngram 5=850874

# \1-grams:
# -5.7532206      <unk>   0
# 0       <s>     -0.06677356
# -3.4645514      drugi   -0.2088903
# ...

# Now it is very important also add a </s> token to the n-gram
# so that it can be correctly loaded. You can simple copy the line:

# 0       <s>     -0.06677356

# and change <s> to </s>. When doing this you should also inclease `ngram` by 1.
# The new ngram should look as follows:

# \data\
# ngram 1=86587
# ngram 2=546387
# ngram 3=796581
# ngram 4=843999
# ngram 5=850874

# \1-grams:
# -5.7532206      <unk>   0
# 0       <s>     -0.06677356
# 0       </s>     -0.06677356
# -3.4645514      drugi   -0.2088903
# ...

# Now the ngram can be correctly used with `pyctcdecode`

See: Wav2Vec2_PyCTCDecode/create_ngram.py at main · patrickvonplaten/Wav2Vec2_PyCTCDecode · GitHub

Feel free to copy those lines of code. Multi-Lingual librispeech is already a very clean text corpus. You might want to pre-process other text corpora to not include any punctuation etc…

As an example this step created the Spanish 5-gram here: kensho/5gram-spanish-kenLM · Hugging Face

Now we should load the language model in a PyCTCBeamSearchDecoder as this is the format we need. Here one should be very careful to choose exactly the same vocabulary as the Wav2Vec2’s tokenizer vocab.

At first we should pick a fine-tuned Wav2Vec2 model that we would like to add a language model to.
Let’s choose: jonatasgrosman/wav2vec2-large-xlsr-53-spanish · Hugging Face

Now we instantiate a BeamSearchDecoder and save it to a folder wav2vec2_with_lm.

E.g. you can run this code:

from transformers import AutoTokenizer
from pyctcdecode import build_ctcdecoder

tokenizer = AutoTokenizer.from_pretrained("jonatasgrosman/wav2vec2-large-xlsr-53-spanish")

vocab_dict = tokenizer.get_vocab()
sorted_dict = {k: v for k, v in sorted(vocab_dict.items(), key=lambda item: item[1])}

decoder = build_ctcdecoder(
    list(sorted_dict.keys()),
    args.path_to_ngram,
)

decoder.save_to_dir("wav2vec2_with_lm")

Now we should have saved the following files in wav2vec2_with_lm:

- language_model
  - attrs.json
  - kenLM.arpa
  - unigrams.txt
- alphabet.json

That’s it! Now all you need to do is to upload this to your Wav2Vec2 model so that the directory structure looks as follows:

Now, all you need to change is your decoding to:

boris · December 10, 2021, 4:07pm

This is cool!
Can you make the inference widget use the language model?

edugp · December 17, 2021, 6:47am

This is awesome!
I have a bunch of KenLM models for different languages trained on sentence piece-tokenized Wikipedia and OSCAR here if you find them useful: edugp/kenlm at main

Harveenchadha · December 17, 2021, 6:32pm

Hi @patrickvonplaten

Were you able to install kenlm in spaces?

csikasote · December 18, 2021, 9:46am

This is awesome. Thank you!

edugp · December 20, 2021, 2:59pm

@Harveenchadha you can install it adding this line to your requirements.txt:
https://github.com/kpu/kenlm/archive/master.zip

patrickvonplaten · December 22, 2021, 1:51pm

Yes working on it!

RASMUS · January 2, 2022, 6:16pm

If someone is trying this with LM like me and has problems with AutoProcessor / AutoModelForCTC
install transformers from @patrickvonplaten commit [AutoProcessor] Correct AutoProcessor and automatically add processor… · huggingface/transformers@a139288 · GitHub

At least I was able to load my model ( RASMUS/wav2vec2-xlsr · Hugging Face ) with LM after that fix

eli4s · January 7, 2022, 11:30am

Hi,
Thank you for sharing resources !

I have come across the paper of wav2vec2, and according to the paper it seems like using a Transformer as a language model yields the best results in terms of WER.
However, I do not find any other reference/resource on the subject, nor did I find the weights of the Transformer used for decoding.

Is it possible that you might have any resource on the subject? (where I could find the weights of such transformer decoder)

Thanks a lot for helping the community

yaswanth · January 27, 2022, 4:58pm

While predicting the text with Model with LM , i am getting below error.

patrickvonplaten · January 29, 2022, 2:06pm

Hey @yaswanth,

Sorry to reply so late. Could you add a fully reproducible code sample? This is as minimal as possible?

yaswanth · January 29, 2022, 2:58pm

code:

Model_Files:

Error:

diegoseto · May 10, 2022, 4:13pm

Hi @patrickvonplaten @philschmid

Is it possible to deploy Wav2Vec2 with a KenLM Language Model in Amazon SageMaker? I was following this article https://www.philschmid.de/automatic-speech-recognition-sagemaker but i couldn’t find an option in the sagemaker HF library to use a language model.

philschmid · May 10, 2022, 5:10pm

Hey @diegoseto,

Yes, it is possible by deploying your model from Amazon S3 with a custom inference.py including the inference code.
Here is an example notebook for using a sentence-transformers. You just need to modify the inference.py and how the model.tar.gz is created.

github.com

huggingface/notebooks/blob/main/sagemaker/17_custom_inference_script/sagemaker-notebook.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Sentence Embeddings with Hugging Face Transformers, Sentence Transformers and Amazon SageMaker - Custom Inference for creating document embeddings with Hugging Face's Transformers\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a1644f1",
   "metadata": {},
   "source": [
    "Welcome to this getting started guide. We will use the Hugging Face Inference DLCs and Amazon SageMaker Python SDK to create a [real-time inference endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html) running a Sentence Transformers for document embeddings. Currently, the [SageMaker Hugging Face Inference Toolkit](https://github.com/aws/sagemaker-huggingface-inference-toolkit) supports the [pipeline feature](https://huggingface.co/transformers/main_classes/pipelines.html) from Transformers for zero-code deployment. This means you can run compatible Hugging Face Transformer models without providing pre- & post-processing code. Therefore we only need to provide an environment variable `HF_TASK` and `HF_MODEL_ID` when creating our endpoint and the Inference Toolkit will take care of it. This is a great feature if you are working with existing [pipelines](https://huggingface.co/transformers/main_classes/pipelines.html).\n",
    "\n",
    "If you want to run other tasks, such as creating document embeddings, you can the pre- and post-processing code yourself, via an `inference.py` script. The Hugging Face Inference Toolkit allows the user to override the default methods of the `HuggingFaceHandlerService`.\n",
    "\n",
    "The custom module can override the following methods:\n",
    "\n",

This file has been truncated. show original

diegoseto · May 10, 2022, 5:45pm

Cool. Thank you, i will try this. Thanks for the quick reply too

jrf · May 5, 2023, 6:17pm

Hi everyone - I’m pretty sure I know the answer to this but just making sure. Has anyone managed to use the processor with a kenlm language model on Windows? I created my .arpa file on a Linux os hoping to be able to use it with wav2vec2 on Windows without actually building/installing kenlm, but this does not appear to be the case. Even when I create the processor, bundled with the lm, on Linux and then load the processor from_pretrained on my Windows script (based on the blog), it still wants to call kenlm and crashes. Just want to confirm that there is no way around this and that, at this time, Linux is mandatory for using wav2vec2 with a language model, even with an already existing .arpa file that has been created somewhere else.
thanks!
Jonathan

Topic		Replies	Views
Train and inference wav2vec2 using a language model Intermediate	1	679	May 2, 2021
Language model for wav2vec2.0 decoding Models	36	13826	August 3, 2024
Wav2vec: how to run decoding with a language model? Beginners	6	6332	August 24, 2022
Making predictions in Boosting wav2vec2 with n-grams Models	2	409	October 25, 2022
Boosting Wav2Vec2-xls-r with an N gram decoder using the transcripts used to train wav2vec2 Models	1	937	July 26, 2022

How to create Wav2Vec2 With Language model

Related topics