Sagemaker Serverless Inference

Hi there,

I have been trying to use the new serverless feature from Sagemaker Inference, following the different steps very well explained by @juliensimon in his video (using same Image for the container and same ServerlessConfig) to use an HuggingFace model (not fine-tuned on my side). However after having successfully created/deployed all resources (Model, EndpointConfig, Endpoint) and trying to invokeEndpoint, I encountered this error :

'Message': 'An exception occurred from internal dependency. Please contact customer support regarding request ...'

And when looking on cloudwatch, I also get this message :

python: can't open file "/usr/local/bin/": [Errno 13] Permission denied

Error that I don’t get when using invokeEndpoint for a non-serverless Inference Endpoint.

Did someone already encounter this error ?

Thanks in advance !!

Hey @YannAgora,

Thanks for opening the thread. We encountered this error as well. See blog post

  • found limitation when testing: Currently, Transformer models > 512MB create errors

We already reported this error to the SageMaker team.

Which model are you trying to use? with which memory configuration?

Hi @philschmid,

For now I tried on tuner007/pegasus_paraphrase and Vamsi/T5_Paraphrase_Paws that are indeed two models over 512MB which confirms the limitation you found. And for the MemorySize I set the value to 6144MB which is the max value available if I’m not mistaken.

I’ll let you know when this is fixed! I hope soon!

  1. has the 512MB limitation issue been resolved? does this issue exist only with huggingface models?
  2. is there a gpu support for the serverless inference?

any info will be greatly appreciated.

  1. apparently, it does not support gpus currently. i wonder what cpu instance type it uses and whether it is too slow for real-time inference on transformers.


  1. There is no fix yet for it but there is a workaround. You can set an environment variable MMS_DEFAULT_WORKERS_PER_MODEL=1 when creating the endpoint.
  2. Since Serverless Inference is powered by AWS Lambda and AWS Lambda doesn’t have GPU support yet Serverless Inference won’t have it as well. And i assume it will get GPU support when AWS Lambda has GPU support.
GPUs for inference are only relevant when there are parallelism opportunities, i.e. an inference request requires lots of computations. I often find CPUs sufficient for simpler workloads.

In the end the respoonse time will depend on many different factors, such as model size, payload, additional pre/postprocessing, etc. I have deployed a small BERT model for sentiment classification on a serverless endpoint and response times are ~200-300 ms (including passing the request through an API):


Hope that helps.


Thank you for response @philschmid .The trick will be helpful when deploying.
Thank you for the test @marshmellow77. The model I am using is large so I guess I will attempt to deploy an ONNX model to bring inference to below 1 second. Is your classifier based on DistilBERT?

Yes, the classifier is based on DistilBert.

I’m also running into this error when running a server less inference with sagemakers PyTorchModel() function.

similarity_elec_model = PyTorchModel(

serverless config is setup with below code

serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=6144, max_concurrency=2,

Has there been a fix for this?

Hey @bennicholl,

this error should no longer exist. Could you please try using the HuggingFaceModel with transformers_version 4.17 and pytorch 1.6?