Serveless memory problem when deploy Wav2Vec2 with custom inference code


I’m facing issues with Wav2Vec2 deployment in Amazon SageMaker using the serveless option, but only when i’m using a custom inference script (passing the path of the model.tar.gz located at Amazon S3 Bucket). I’m receiving the following memory error:

"ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: “Inference failed due to insufficient memory on the Endpoint. Please add more memory to the endpoint.”

Serveless inference config code(works fine when using the model hosted on the HF Hub):

serverless_config = ServerlessInferenceConfig(memory_size_in_mb=4096, max_concurrency=10)

Custom inference script code:

import json
import torch
from transformers import pipeline

def model_fn(model_dir):

    pipe = pipeline('automatic-speech-recognition', model_dir, chunk_length_s = 10)
    return pipe

def input_fn(json_request_data, content_type='application/json'):  
    input_data = json.loads(json_request_data)

    return input_data

def predict_fn(input_data, pipe):
    result = pipe(input_data)

    return result
def output_fn(transcript, accept='application/json'):

    return json.dumps(transcript), accept

Can anyone help me?

Could you try a higher memory size configuration? 6192? Which model are you testing?

Hi @philschmid

I already tried with this configuration but i got the same error. I’m testing a wav2vec2-large-xlsr model (private)

@diegoseto is there a particular reason why you are creating a script? You can directly provide your HF_API_TOKEN in the hub configuration next to you model id and task. See HF_API_TOKEN

@philschmid yes, the reason is that i want to use a language model and configure the “chunk_length_s” pipeline parameter, but there is no option in Amazon Sagemaker library without creating a custom inference script (at least I didn’t find)

Hi Diego

When using a custom inference script you are leveraging the SageMaker Hugging Face Inference Toolkit. Now the cool part is that this toolkit actually using the pipelines API in the background, see here.

What that means for you is that you actually don’t have to write an inference script. Instead you can provide additional parameters when calling the endpoint, like so (this is an example for text generation, but the same principle applies in your case):

prompt = st.text_area("Enter your prompt here:")

params = {"return_full_text": True,
          "temperature": temp,
          "min_length": 50,
          "max_length": 100,
          "do_sample": True,
          "repetition_penalty": rep_penalty,
          "top_k": 20,

payload = {"inputs": prompt, "parameters": params}

response = sagemaker_runtime.invoke_endpoint(

Try it out and use the endpoint with the chunk_length_s parameter, this should work.

Hope that helps!



Hi @marshmellow77

Cool! I will try this, thanks. What about the use of a language model in inference? There is another option?

You mean other than serverless? Yes, there’s actually 4 different inference options on Sagemaker. @philschmid just released a blog post comparing the different options:

No, i mean the use of a language model to boosting wav2vec2 decoding as described by @patrickvonplaten here How to create Wav2Vec2 With Language model, but in Amazon Sagemaker (serveless). In this topic @philschmid suggested using custom inference script, but i’m having problems as mentioned above.

There is another option to use a language model without a custom inference script?

What is the model size of your custom model? and also how are you creating the model.tar.gz ? I might be possible the zip size or model size caused the issue.

pytorch_model.bin size in gb: 1.17GB
model.tar.gz size in gb: 1.08GB
Number of parameters: 315438720

I’m creating the model.tar.gz using the following command:

tar zcvf model.tar.gz *

@diegoseto i created a whole e2e example using jonatasgrosman/wav2vec2-large-xlsr-53-english and didn’t see any error, same model size and everything. (i removed the language model folder to have a average folder size)

1 Like

Works perfectly fine now only overwriting “model_fn” function, i probably made some mistake overwriting other function :face_with_diagonal_mouth:. Thank you very much for your help and your time

@philschmid another question, if you could help me. I’m running the model in my local machine with a language model setted. My directory structure:


Locally (loading the model using the pipeline object), the language model works fine in the inference, but when deployed to SageMaker apparently he is not making use of the LM (i’m comparing the inference results). Everything is the same than locally, the pipeline, the model and the transformers version (4.17.0).

Did I forget something?

How did you set up your local env? Did you install additional dependencies? Have you installed KenLM following theses steps for you local env: Boosting Wav2Vec2 with n-grams in 🤗 Transformers?
I think KenLM is not yet available in the DLC

hi @philschmid

Yes, i followed the steps in that article you mentioned. If Kenlm is not available in the DLC, the other way is overwrite the predict_fn function in custom inference script, right? If yes, do you have any examples for Wav2Vec2 (like the other script you made overwriting only the model_fn)?


I think this wouldn’t solve the missing KenLM model. What you could do is use os.system('install kenlm') at the top of your to install it on start up (needs to finish under 2 min/ i am not sure what the behavior is for serverless)

I tried this but i’m still getting the same result.

import os
from transformers import pipeline

os.system('install kenlm')

def model_fn(model_dir):

    pipe = pipeline('automatic-speech-recognition', model_dir, chunk_length_s = 10)
    return pipe

@diegoseto with os.system('install kenlm') i meant adding the steps to install kenlm

hi @philschmid

I tried to install kenlm following the steps of the article using the os.system and the commands seem to work fine but i got this error when predict:

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from model with message "{
  "code": 400,
  "type": "InternalServerException",
  "message": "module kenlm has no attribute Model"


import os
from transformers import pipeline

os.system('sudo apt install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev')

os.system('wget -O - | tar xz')

os.system('mkdir kenlm/build && cd kenlm/build && cmake .. && make -j2')

def model_fn(model_dir):

    pipe = pipeline('automatic-speech-recognition', model_dir, chunk_length_s = 10)
    return pipe

I tried to install kenlm module via requirements.txt too, but i got other error:

UnexpectedStatusException: Error hosting endpoint huggingface-pytorch-inference-2022-05-25-19-13-03-317: Failed. Reason: Received server error (0) from model with message "An error occurred while handling request as the model process exited.". See in account 094463604469 for more information..

Checking the logs looks i’m receiving a permission denied when use the src directory (created by kenlm module setup)

OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k
python: can't open file '/usr/local/bin/': [Errno 13] Permission denied
OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k
Defaulting to user installation because normal site-packages is not writeable
Obtaining kenlm from git+ (from -r /opt/ml/model/code/requirements.txt (line 1))
ERROR: Could not install packages due to an OSError: [Errno 13] Permission denied: '/src'
Check the permissions.
WARNING: There was an error checking the latest version of pip.
2022-05-25 19:15:11,902 - sagemaker-inference - ERROR - failed to install required packages, exiting
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/sagemaker_inference/", line 189, in _install_requirements
  File "/opt/conda/lib/python3.8/", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-m', 'pip', 'install', '-r', '/opt/ml/model/code/requirements.txt']' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/bin/", line 23, in <module>
  File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/", line 34, in main
  File "/opt/conda/lib/python3.8/site-packages/", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/opt/conda/lib/python3.8/site-packages/", line 206, in call
    return attempt.get(self._wrap_exception)
  File "/opt/conda/lib/python3.8/site-packages/", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/opt/conda/lib/python3.8/site-packages/", line 719, in reraise
    raise value
  File "/opt/conda/lib/python3.8/site-packages/", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/", line 30, in _start_mms
  File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/", line 91, in start_model_server
  File "/opt/conda/lib/python3.8/site-packages/sagemaker_inference/", line 192, in _install_requirements
    raise ValueError("failed to install required packages")
ValueError: failed to install required packages

My requirements.txt (i tried to install via pip using os.system too):

-e git+