How to train KenLM no AWS Sagemaker?

pierreguillou · May 27, 2022, 8:40pm

HI,

Within the blog post of @patrickvonplaten Boosting Wav2Vec2 with n-grams in Transformers, there is the following installation code of the n-grams language model KenLM:

#Let's start by installing the Ubuntu library prerequisites:
sudo apt install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev

# before downloading and unpacking the KenLM repo.
wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz

# KenLM is written in C++, so we'll make use of cmake to build the binaries.
mkdir kenlm/build && cd kenlm/build && cmake .. && make -j2
ls kenlm/build/bin

However, this code is for the Linux distribution Ubuntu. It does not run in the Amazon Linux distribution of AWS Sagemaker instances.

Anyone knows how to adapt this code?

And what about the KenLM code for inference in AWS Sagemaker to deploy a Wav2vec model with KenLM inside an inference script for real-time inference?

Note: I read the thread Serveless memory problem when deploy Wav2Vec2 with custom inference code and the post of @philschmid about how to install KenLM no AWS Sagemaker but as said, these instructions are for an Ubuntu distribution, not for the Amazon Linux one. Then, they do not work an AWS Sagemaker.

pierreguillou · May 31, 2022, 6:47pm

Hello.

This issue no github do KenLM gives a link to the boost installation:

Here are the instructions for installing Boost in a home directory dependencies . kenlm . code . Kenneth Heafield

Then, just add sudo in a AWS Sagemaker terminal to the following commands and you will succeed in installing Boost:

wget https://dl.bintray.com/boostorg/release/1.72.0/source/boost_1_72_0.tar.bz2
tar xjf boost_1_72_0.tar.bz2

cd boost_1_72_0

sudo ./bootstrap.sh

#Note that this may fail.  A common cause is incorrectly installed gzip 
#or bzip2.  KenLM does not use the iostreams library, so it might work 
#anyway.  But an incomplete install will break other packages like Moses.

sudo ./b2 --prefix=$PREFIX --libdir=$LIBDIR --layout=tagged link=static,shared threading=multi,single install -j4 || echo FAILURE
cd ...

After that, just follow the 2 other instructions:

# before downloading and unpacking the KenLM repo.
wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz

# KenLM is written in C++, so we'll make use of cmake to build the binaries.
mkdir kenlm/build && cd kenlm/build && cmake .. && make -j2
ls kenlm/build/bin

Et voilà! You can train a model KenLM in AWS Sagemaker. (thanks to Egberto Caetano Araujo da Silva who found this solution!)

philschmid · June 1, 2022, 8:02am

FYI the SageMaker DLCs, both Training and Inference are UBUNTU based. You can find the dockerfiles here: deep-learning-containers/huggingface at master · aws/deep-learning-containers · GitHub

sid-lekh · February 11, 2024, 12:13pm

The KenLM building requires cmake. cmake is not available. while trying to install cmake, libssl-dev is missing. Not able to progress so forth.

Topic		Replies	Views
How to create Wav2Vec2 With Language model 🤗Transformers	15	5969	May 5, 2023
Train and inference wav2vec2 using a language model Intermediate	1	681	May 2, 2021
Wav2Vec2ProcessorWithLM intended usage 🤗Transformers	0	1001	August 23, 2022
Transformers 4.9.0 on SageMaker Amazon SageMaker	12	1968	March 25, 2022
Serveless memory problem when deploy Wav2Vec2 with custom inference code Amazon SageMaker	23	4010	May 27, 2022

How to train KenLM no AWS Sagemaker?

Related topics