How to train KenLM no AWS Sagemaker?

HI,

Within the blog post of @patrickvonplaten Boosting Wav2Vec2 with n-grams in :hugs: Transformers, there is the following installation code of the n-grams language model KenLM:

#Let's start by installing the Ubuntu library prerequisites:
sudo apt install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev

# before downloading and unpacking the KenLM repo.
wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz

# KenLM is written in C++, so we'll make use of cmake to build the binaries.
mkdir kenlm/build && cd kenlm/build && cmake .. && make -j2
ls kenlm/build/bin

However, this code is for the Linux distribution Ubuntu. It does not run in the Amazon Linux distribution of AWS Sagemaker instances.

Anyone knows how to adapt this code?

And what about the KenLM code for inference in AWS Sagemaker to deploy a Wav2vec model with KenLM inside an inference script for real-time inference?

Note: I read the thread Serveless memory problem when deploy Wav2Vec2 with custom inference code and the post of @philschmid about how to install KenLM no AWS Sagemaker but as said, these instructions are for an Ubuntu distribution, not for the Amazon Linux one. Then, they do not work an AWS Sagemaker.

Hello.

This issue no github do KenLM gives a link to the boost installation:

Here are the instructions for installing Boost in a home directory dependencies . kenlm . code . Kenneth Heafield

Then, just add sudo in a AWS Sagemaker terminal to the following commands and you will succeed in installing Boost:

wget https://dl.bintray.com/boostorg/release/1.72.0/source/boost_1_72_0.tar.bz2
tar xjf boost_1_72_0.tar.bz2

cd boost_1_72_0

sudo ./bootstrap.sh

#Note that this may fail.  A common cause is incorrectly installed gzip 
#or bzip2.  KenLM does not use the iostreams library, so it might work 
#anyway.  But an incomplete install will break other packages like Moses.

sudo ./b2 --prefix=$PREFIX --libdir=$LIBDIR --layout=tagged link=static,shared threading=multi,single install -j4 || echo FAILURE
cd ...

After that, just follow the 2 other instructions:

# before downloading and unpacking the KenLM repo.
wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz

# KenLM is written in C++, so we'll make use of cmake to build the binaries.
mkdir kenlm/build && cd kenlm/build && cmake .. && make -j2
ls kenlm/build/bin

Et voilà! You can train a model KenLM in AWS Sagemaker. (thanks to Egberto Caetano Araujo da Silva who found this solution!)

1 Like

FYI the SageMaker DLCs, both Training and Inference are UBUNTU based. You can find the dockerfiles here: deep-learning-containers/huggingface at master · aws/deep-learning-containers · GitHub

1 Like