Sagemaker Serverless Inference

Hi there,

I have been trying to use the new serverless feature from Sagemaker Inference, following the different steps very well explained by @juliensimon in his video (using same Image for the container and same ServerlessConfig) to use an HuggingFace model (not fine-tuned on my side). However after having successfully created/deployed all resources (Model, EndpointConfig, Endpoint) and trying to invokeEndpoint, I encountered this error :

'Message': 'An exception occurred from internal dependency. Please contact customer support regarding request ...'

And when looking on cloudwatch, I also get this message :

python: can't open file "/usr/local/bin/": [Errno 13] Permission denied

Error that I don’t get when using invokeEndpoint for a non-serverless Inference Endpoint.

Did someone already encounter this error ?

Thanks in advance !!

1 Like

Hey @YannAgora,

Thanks for opening the thread. We encountered this error as well. See blog post

  • found limitation when testing: Currently, Transformer models > 512MB create errors

We already reported this error to the SageMaker team.

Which model are you trying to use? with which memory configuration?

Hi @philschmid,

For now I tried on tuner007/pegasus_paraphrase and Vamsi/T5_Paraphrase_Paws that are indeed two models over 512MB which confirms the limitation you found. And for the MemorySize I set the value to 6144MB which is the max value available if I’m not mistaken.

I’ll let you know when this is fixed! I hope soon!

1 Like


  1. has the 512MB limitation issue been resolved? does this issue exist only with huggingface models?
  2. is there a gpu support for the serverless inference?

any info will be greatly appreciated.

  1. apparently, it does not support gpus currently. i wonder what cpu instance type it uses and whether it is too slow for real-time inference on transformers.


  1. There is no fix yet for it but there is a workaround. You can set an environment variable MMS_DEFAULT_WORKERS_PER_MODEL=1 when creating the endpoint.
  2. Since Serverless Inference is powered by AWS Lambda and AWS Lambda doesn’t have GPU support yet Serverless Inference won’t have it as well. And i assume it will get GPU support when AWS Lambda has GPU support.
1 Like

GPUs for inference are only relevant when there are parallelism opportunities, i.e. an inference request requires lots of computations. I often find CPUs sufficient for simpler workloads.

In the end the respoonse time will depend on many different factors, such as model size, payload, additional pre/postprocessing, etc. I have deployed a small BERT model for sentiment classification on a serverless endpoint and response times are ~200-300 ms (including passing the request through an API):


Hope that helps.


1 Like

Thank you for response @philschmid .The trick will be helpful when deploying.
Thank you for the test @marshmellow77. The model I am using is large so I guess I will attempt to deploy an ONNX model to bring inference to below 1 second. Is your classifier based on DistilBERT?

1 Like

Yes, the classifier is based on DistilBert.

I’m also running into this error when running a server less inference with sagemakers PyTorchModel() function.

similarity_elec_model = PyTorchModel(

serverless config is setup with below code

serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=6144, max_concurrency=2,

Has there been a fix for this?

Hey @bennicholl,

this error should no longer exist. Could you please try using the HuggingFaceModel with transformers_version 4.17 and pytorch 1.6?