Hey @philschmid thanks for your response and for looking into the model performance. I am using the distilbert-base-uncased model, finetuned using a custom dataset for multiclass categorisation. The task is text-classification and the instance used for deployment is ml.g4dn.xlarge.
I tested this morning deploying with the sagemaker sdk, instead of terraform, and found the same issue. For the first deployment I did the following:
ENDPOINT_NAME = "bert-latency-test"
huggingface_model = HuggingFaceModel(
role="-",
model_data="-",
entry_point="-",
source_dir="-",
code_location="-",
transformers_version='4.6',
pytorch_version='1.7',
py_version='py36',
)
pred = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g4dn.xlarge",
endpoint_name=ENDPOINT_NAME,
)
and made requests with
request = {"inputs": [-]}
pred.predict(request)
The “inputs” in each request is a list of 25 short strings. The image used for the container associated with the created SageMaker model is:
-/huggingface-pytorch-inference:1.7-transformers4.6-gpu-py36-cu110-ubuntu18.04
attaching a similar screenshot from cloudwatch showing model and overhead latency
For the second test everything stays constant except the transformers, python and pytorch version:
ENDPOINT_NAME = "bert-latency-test"
huggingface_model = HuggingFaceModel(
role="-",
model_data="-",
entry_point="-",
source_dir="-",
code_location="-",
transformers_version='4.12.3',
pytorch_version='1.9.1',
py_version='py38',
)
pred = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g4dn.xlarge",
endpoint_name=ENDPOINT_NAME,
)
which creates a mode using the following image:
-/huggingface-pytorch-inference:1.9.1-transformers4.12.3-gpu-py38-cu111-ubuntu20.04
Requests are sent in the same way, and it’s here I observe the drop in model latency as per the following screenshot from cloudwatch.
SEE BELOW REPLY FOR HIGH MODEL LATENCY SCREENSHOT
Would be very appreciative of any advice you can give here.
On top of this (and this should possibly go in a separate thread) the way our application will make requests to the endpoint is through a sagemaker-runtime client. I’ve noticed there is quite a jump in OverheadLatency when doing this, instead of using the sagemaker Predictor object. e.g.
import boto3
from sagemaker.serializers import JSONSerializer
client = boto3.client("sagemaker-runtime")
client.invoke_endpoint(EndpointName=ENDPOINT_NAME, Body=JSONSerializer().serialize(request) ContentType="application/json")
SEE BELOW REPLY FOR HIGH OVERHEAD LATENCY SCREENSHOT
Is there anything I can do to minimise the overhead latency here?
noticed that new users can only embed one media per post, so I’ll put the other screenshots in other replies.
Many thanks for your help,
Owen