I have been trying to use the new serverless feature from Sagemaker Inference, following the different steps very well explained by @juliensimon in his video (using same Image for the container and same ServerlessConfig) to use an HuggingFace model (not fine-tuned on my side). However after having successfully created/deployed all resources (Model, EndpointConfig, Endpoint) and trying to invokeEndpoint, I encountered this error :
'Message': 'An exception occurred from internal dependency. Please contact customer support regarding request ...'
And when looking on cloudwatch, I also get this message :
python: can't open file "/usr/local/bin/deep_learning_container.py": [Errno 13] Permission denied
Error that I don’t get when using invokeEndpoint for a non-serverless Inference Endpoint.
For now I tried on tuner007/pegasus_paraphrase and Vamsi/T5_Paraphrase_Paws that are indeed two models over 512MB which confirms the limitation you found. And for the MemorySize I set the value to 6144MB which is the max value available if I’m not mistaken.
apparently, it does not support gpus currently. i wonder what cpu instance type it uses and whether it is too slow for real-time inference on transformers.
There is no fix yet for it but there is a workaround. You can set an environment variable MMS_DEFAULT_WORKERS_PER_MODEL=1 when creating the endpoint.
Since Serverless Inference is powered by AWS Lambda and AWS Lambda doesn’t have GPU support yet Serverless Inference won’t have it as well. And i assume it will get GPU support when AWS Lambda has GPU support.
GPUs for inference are only relevant when there are parallelism opportunities, i.e. an inference request requires lots of computations. I often find CPUs sufficient for simpler workloads.
In the end the respoonse time will depend on many different factors, such as model size, payload, additional pre/postprocessing, etc. I have deployed a small BERT model for sentiment classification on a serverless endpoint and response times are ~200-300 ms (including passing the request through an API):
Thank you for response @philschmid .The trick will be helpful when deploying.
Thank you for the test @marshmellow77. The model I am using is large so I guess I will attempt to deploy an ONNX model to bring inference to below 1 second. Is your classifier based on DistilBERT?
I just setup a serverless endpoint for conversational AI using facebook/blenderbot-400M-distill.
I still receive this error on cloudwatch and the API call times out. It does eventually return a response but will return this error a few times before that happens.
I set up this endpoint using your cdk, sagemaker-serverless-huggingface-endpoint.
Do you know if there is something I can change in the setup which will prevent this from happening?
This is the error message from cloudwatch:
2022-10-17T06:02:34,139 [WARN ] W-9000-facebook__blenderbot-400M-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - python: can’t open file ‘/usr/local/bin/deep_learning_container.py’: [Errno 13] Permission denied
Sometimes after 2-4 attempts the model works and returns a response. As long as you interact with it consistently, it continues to work, and the above error does not occur again.
Edit:
I’ve just seen you mentioned transformers_version 4.17 and the CDK used 4.12.3. So I think it will get fixed by reinstalling it using LATEST_TRANSFORMERS_VERSION = “4.17.0” in config.py?
I’ve just seen you mentioned transformers_version 4.17 and the CDK used 4.12.3. So I think it will get fixed by reinstalling it using LATEST_TRANSFORMERS_VERSION = “4.17.0” in config.py?
I used this CDK with the changes to the versions in the config file. I also had to make some changes to IAM permissions during the installation to get it to complete.