Sagemaker Serverless Inference

YannAgora · December 30, 2021, 1:44pm

Hi there,

I have been trying to use the new serverless feature from Sagemaker Inference, following the different steps very well explained by @juliensimon in his video (using same Image for the container and same ServerlessConfig) to use an HuggingFace model (not fine-tuned on my side). However after having successfully created/deployed all resources (Model, EndpointConfig, Endpoint) and trying to invokeEndpoint, I encountered this error :

'Message': 'An exception occurred from internal dependency. Please contact customer support regarding request ...'

And when looking on cloudwatch, I also get this message :

python: can't open file "/usr/local/bin/deep_learning_container.py": [Errno 13] Permission denied

Error that I don’t get when using invokeEndpoint for a non-serverless Inference Endpoint.

Did someone already encounter this error ?

Thanks in advance !!

philschmid · December 30, 2021, 1:46pm

Hey @YannAgora,

Thanks for opening the thread. We encountered this error as well. See blog post

found limitation when testing: Currently, Transformer models > 512MB create errors

We already reported this error to the SageMaker team.

Which model are you trying to use? with which memory configuration?

YannAgora · December 30, 2021, 1:52pm

Hi @philschmid,

For now I tried on tuner007/pegasus_paraphrase and Vamsi/T5_Paraphrase_Paws that are indeed two models over 512MB which confirms the limitation you found. And for the MemorySize I set the value to 6144MB which is the max value available if I’m not mistaken.

philschmid · December 30, 2021, 2:09pm

I’ll let you know when this is fixed! I hope soon!

nurik · January 24, 2022, 7:18pm

hi!

has the 512MB limitation issue been resolved? does this issue exist only with huggingface models?
is there a gpu support for the serverless inference?

any info will be greatly appreciated.

nurik · January 24, 2022, 8:41pm

apparently, it does not support gpus currently. i wonder what cpu instance type it uses and whether it is too slow for real-time inference on transformers.

philschmid · January 25, 2022, 12:49pm

Hello,

There is no fix yet for it but there is a workaround. You can set an environment variable MMS_DEFAULT_WORKERS_PER_MODEL=1 when creating the endpoint.
Since Serverless Inference is powered by AWS Lambda and AWS Lambda doesn’t have GPU support yet Serverless Inference won’t have it as well. And i assume it will get GPU support when AWS Lambda has GPU support.

marshmellow77 · January 25, 2022, 4:33pm

GPUs for inference are only relevant when there are parallelism opportunities, i.e. an inference request requires lots of computations. I often find CPUs sufficient for simpler workloads.

In the end the respoonse time will depend on many different factors, such as model size, payload, additional pre/postprocessing, etc. I have deployed a small BERT model for sentiment classification on a serverless endpoint and response times are ~200-300 ms (including passing the request through an API):

Capture2

Hope that helps.

Cheers
Heiko

nurik · January 25, 2022, 4:50pm

Thank you for response @philschmid .The trick will be helpful when deploying.
Thank you for the test @marshmellow77. The model I am using is large so I guess I will attempt to deploy an ONNX model to bring inference to below 1 second. Is your classifier based on DistilBERT?

marshmellow77 · January 25, 2022, 5:47pm

Yes, the classifier is based on DistilBert.

bennicholl · June 14, 2022, 5:39pm

I’m also running into this error when running a server less inference with sagemakers PyTorchModel() function.

similarity_elec_model = PyTorchModel(
                         model_data='s3://pathname/model.tar.gz',
                         role=role,
                         entry_point='torchserve.py',
                         source_dir='source_dir',
                         framework_version='1.6.0',
                         py_version='py36')

serverless config is setup with below code

serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=6144, max_concurrency=2,
)

Has there been a fix for this?

philschmid · June 15, 2022, 8:52am

Hey @bennicholl,

this error should no longer exist. Could you please try using the HuggingFaceModel with transformers_version 4.17 and pytorch 1.6?

dshandler-monash · October 15, 2022, 3:15am

Hey @philschmid,

I just setup a serverless endpoint for conversational AI using facebook/blenderbot-400M-distill.

I still receive this error on cloudwatch and the API call times out. It does eventually return a response but will return this error a few times before that happens.

I set up this endpoint using your cdk, sagemaker-serverless-huggingface-endpoint.

Do you know if there is something I can change in the setup which will prevent this from happening?

Thanks,

Darren

philschmid · October 17, 2022, 5:41am

@dshandler-monash can you please describe a bit more your setup? How much memory has your configuration? what is the error you are seeing…

dshandler-monash · October 17, 2022, 6:51am

Thanks for replying.

I’m using AWS API Gateway to call sagemaker.

The endpoint has 6GB of memory, Max Concurrency = 8 and the model has MMS_DEFAULT_WORKERS_PER_MODEL=1

image: huggingface-pytorch-inference:1.9.1-transformers4.12.3-cpu-py38-ubuntu20.04

This is the error message from cloudwatch:
2022-10-17T06:02:34,139 [WARN ] W-9000-facebook__blenderbot-400M-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - python: can’t open file ‘/usr/local/bin/deep_learning_container.py’: [Errno 13] Permission denied

Sometimes after 2-4 attempts the model works and returns a response. As long as you interact with it consistently, it continues to work, and the above error does not occur again.

Edit:
I’ve just seen you mentioned transformers_version 4.17 and the CDK used 4.12.3. So I think it will get fixed by reinstalling it using LATEST_TRANSFORMERS_VERSION = “4.17.0” in config.py?

philschmid · October 17, 2022, 8:10am

I’ve just seen you mentioned transformers_version 4.17 and the CDK used 4.12.3. So I think it will get fixed by reinstalling it using LATEST_TRANSFORMERS_VERSION = “4.17.0” in config.py?

Yes should be the case.

dshandler-monash · October 17, 2022, 11:18pm

I’ve recreated it with

huggingface-pytorch-inference:1.10.2-transformers4.17.0-cpu-py38-ubuntu20.04

and unfortunately still run into the same issue. Are there any other solutions you could think of?

philschmid · October 18, 2022, 5:43am

can you please provide the code you used to deploy? so we can try to reproduce your scenario?

dshandler-monash · October 25, 2022, 6:51am

I used this CDK with the changes to the versions in the config file. I also had to make some changes to IAM permissions during the installation to get it to complete.

philschmid · October 31, 2022, 8:37am

This shouldn’t be the case. But can you try deploying your model using the sagemaker-sdk? Serverless Inference with Hugging Face's Transformers, DistilBERT and Amazon SageMaker

Topic		Replies	Views
Sagemaker serverless endpoint deployment error (Image size greater than support size)) Amazon SageMaker	3	1258	July 21, 2023
Inference failed for FLAN-UL2(20B) on SageMaker Amazon SageMaker	6	2195	April 4, 2023
Error: Could Not Load Model Amazon SageMaker	7	6718	March 11, 2022
Inference error for FLAN-UL2 on AWS SageMaker Amazon SageMaker	1	966	April 3, 2023
Serveless memory problem when deploy Wav2Vec2 with custom inference code Amazon SageMaker	23	4030	May 27, 2022

Sagemaker Serverless Inference

Related topics