Sagemaker Serverless Inference

Hey @YannAgora,

Thanks for opening the thread. We encountered this error as well. See blog post

  • found limitation when testing: Currently, Transformer models > 512MB create errors

We already reported this error to the SageMaker team.

Which model are you trying to use? with which memory configuration?

Hi @philschmid,

For now I tried on tuner007/pegasus_paraphrase and Vamsi/T5_Paraphrase_Paws that are indeed two models over 512MB which confirms the limitation you found. And for the MemorySize I set the value to 6144MB which is the max value available if I’m not mistaken.

I’ll let you know when this is fixed! I hope soon!

  1. has the 512MB limitation issue been resolved? does this issue exist only with huggingface models?
  2. is there a gpu support for the serverless inference?

any info will be greatly appreciated.

  1. apparently, it does not support gpus currently. i wonder what cpu instance type it uses and whether it is too slow for real-time inference on transformers.


  1. There is no fix yet for it but there is a workaround. You can set an environment variable MMS_DEFAULT_WORKERS_PER_MODEL=1 when creating the endpoint.
  2. Since Serverless Inference is powered by AWS Lambda and AWS Lambda doesn’t have GPU support yet Serverless Inference won’t have it as well. And i assume it will get GPU support when AWS Lambda has GPU support.
GPUs for inference are only relevant when there are parallelism opportunities, i.e. an inference request requires lots of computations. I often find CPUs sufficient for simpler workloads.

In the end the respoonse time will depend on many different factors, such as model size, payload, additional pre/postprocessing, etc. I have deployed a small BERT model for sentiment classification on a serverless endpoint and response times are ~200-300 ms (including passing the request through an API):


Hope that helps.


Thank you for response @philschmid .The trick will be helpful when deploying.
Thank you for the test @marshmellow77. The model I am using is large so I guess I will attempt to deploy an ONNX model to bring inference to below 1 second. Is your classifier based on DistilBERT?

Yes, the classifier is based on DistilBert.

I’m also running into this error when running a server less inference with sagemakers PyTorchModel() function.

similarity_elec_model = PyTorchModel(

serverless config is setup with below code

serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=6144, max_concurrency=2,

Has there been a fix for this?

Hey @bennicholl,

this error should no longer exist. Could you please try using the HuggingFaceModel with transformers_version 4.17 and pytorch 1.6?

Hey @philschmid,

I just setup a serverless endpoint for conversational AI using facebook/blenderbot-400M-distill.

I still receive this error on cloudwatch and the API call times out. It does eventually return a response but will return this error a few times before that happens.

I set up this endpoint using your cdk, sagemaker-serverless-huggingface-endpoint.

Do you know if there is something I can change in the setup which will prevent this from happening?



@dshandler-monash can you please describe a bit more your setup? How much memory has your configuration? what is the error you are seeing…

Thanks for replying.

I’m using AWS API Gateway to call sagemaker.

The endpoint has 6GB of memory, Max Concurrency = 8 and the model has MMS_DEFAULT_WORKERS_PER_MODEL=1

image: huggingface-pytorch-inference:1.9.1-transformers4.12.3-cpu-py38-ubuntu20.04

This is the error message from cloudwatch:
2022-10-17T06:02:34,139 [WARN ] W-9000-facebook__blenderbot-400M-stderr - python: can’t open file ‘/usr/local/bin/’: [Errno 13] Permission denied

Sometimes after 2-4 attempts the model works and returns a response. As long as you interact with it consistently, it continues to work, and the above error does not occur again.

I’ve just seen you mentioned transformers_version 4.17 and the CDK used 4.12.3. So I think it will get fixed by reinstalling it using LATEST_TRANSFORMERS_VERSION = “4.17.0” in

I’ve just seen you mentioned transformers_version 4.17 and the CDK used 4.12.3. So I think it will get fixed by reinstalling it using LATEST_TRANSFORMERS_VERSION = “4.17.0” in

Yes should be the case.

I’ve recreated it with


and unfortunately still run into the same issue. Are there any other solutions you could think of?

can you please provide the code you used to deploy? so we can try to reproduce your scenario?

I used this CDK with the changes to the versions in the config file. I also had to make some changes to IAM permissions during the installation to get it to complete.

This shouldn’t be the case. But can you try deploying your model using the sagemaker-sdk? Serverless Inference with Hugging Face's Transformers, DistilBERT and Amazon SageMaker

I’m encountering this as well despite using the SDK as specified-- I’ve tried it with both serverless and hosted inference and can’t get passed it (error being permission denied on the deep learning container as noted above). Starting with a roberta-base model, so it isn’t that large in the grand scheme either. I’m using a custom script which comes from an estimator in the ml pipeline. It conforms to the docs in the inference-toolkit but wondering if that is part of this. As a note, this is all part of a Sagemaker ML project (hence you’ll see training_step below, which just points to an s3 path where the model.tar.gz is).

Here is my Model code for reference:

env = {

model = HuggingFaceModel(
    name = model_step_name,
    transformers_version = "4.17",
    pytorch_version = "1.6",
    model_data =,
    role = role,
    env = env