Slow inference using most recent docker image

We have been deploying a BERT model to a SageMaker endpoint using the g4dn.xlarge instance. The deploy is managed with terraform and we use the following image:

{account}.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-inference:1.7-transformers4.6-gpu-py36-cu110-ubuntu18.04

We’re really happy with the model latency, which is about 0.05 seconds for the typical request.

Now we were experimenting with using the more recent image:

{account}.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-inference:1.9.1-transformers4.12.3-gpu-py38-cu111-ubuntu20.04 

and found that model latency dropped by a factor of about 10, everything else being equal. I noticed in the sagemaker sdk docs that it says the supported versions of pytorch, transformers and python respectively are 1.7.1, 4.6.1 and 3.6 respectively.
So my question is more of a curiosity - why does the model latency drop on this most recent image? Is there some issue using the GPU here?
Thanks,
Owen

Hey @ojturner,

Thank you for opening the Thread. Performance drop should not appear! Can you share which model (at least the architecture and task) you use and which instance type?

@ojturner i tested it myself using the distilbert-base-uncased-finetuned-sst-2-english · Hugging Face model with one time transformers=4.6.1 and one time transformers=4.12.3.
For me 4.12.3 is achieving better results than 4.6.1

4.6.1

4.12.3

Hey @philschmid thanks for your response and for looking into the model performance. I am using the distilbert-base-uncased model, finetuned using a custom dataset for multiclass categorisation. The task is text-classification and the instance used for deployment is ml.g4dn.xlarge.

I tested this morning deploying with the sagemaker sdk, instead of terraform, and found the same issue. For the first deployment I did the following:

ENDPOINT_NAME = "bert-latency-test"
huggingface_model = HuggingFaceModel(
    role="-",
    model_data="-",
    entry_point="-",
    source_dir="-",
    code_location="-",
    transformers_version='4.6',
    pytorch_version='1.7',
    py_version='py36',
)
pred = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g4dn.xlarge",
    endpoint_name=ENDPOINT_NAME,
)

and made requests with

request = {"inputs": [-]}
pred.predict(request)

The “inputs” in each request is a list of 25 short strings. The image used for the container associated with the created SageMaker model is:

-/huggingface-pytorch-inference:1.7-transformers4.6-gpu-py36-cu110-ubuntu18.04

attaching a similar screenshot from cloudwatch showing model and overhead latency

For the second test everything stays constant except the transformers, python and pytorch version:

ENDPOINT_NAME = "bert-latency-test"
huggingface_model = HuggingFaceModel(
    role="-",
    model_data="-",
    entry_point="-",
    source_dir="-",
    code_location="-",
    transformers_version='4.12.3',
    pytorch_version='1.9.1',
    py_version='py38',
)
pred = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g4dn.xlarge",
    endpoint_name=ENDPOINT_NAME,
)

which creates a mode using the following image:

-/huggingface-pytorch-inference:1.9.1-transformers4.12.3-gpu-py38-cu111-ubuntu20.04

Requests are sent in the same way, and it’s here I observe the drop in model latency as per the following screenshot from cloudwatch.
SEE BELOW REPLY FOR HIGH MODEL LATENCY SCREENSHOT

Would be very appreciative of any advice you can give here.

On top of this (and this should possibly go in a separate thread) the way our application will make requests to the endpoint is through a sagemaker-runtime client. I’ve noticed there is quite a jump in OverheadLatency when doing this, instead of using the sagemaker Predictor object. e.g.

import boto3
from sagemaker.serializers import JSONSerializer
client = boto3.client("sagemaker-runtime")
client.invoke_endpoint(EndpointName=ENDPOINT_NAME, Body=JSONSerializer().serialize(request) ContentType="application/json")

SEE BELOW REPLY FOR HIGH OVERHEAD LATENCY SCREENSHOT

Is there anything I can do to minimise the overhead latency here?

noticed that new users can only embed one media per post, so I’ll put the other screenshots in other replies.

Many thanks for your help,
Owen

HIGH MODEL LATENCY SCREENSHOT

HIGH OVERHEAD LATENCY SCREENSHOT

Hello @ojturner,

From your code

it looks like you are using a custom inference.py script is that correct? could you provide? Have you tested the latency and overhead using the “zero-code” deployment, without providing inference.py.

Could you also share more information about which model/model-architecture/task you are using?

Hey @philschmid - yes that’s right, we’ve been using a custom inference.py script. The only thing we have in there is a custom model_fn to return all scores, instead of just the top category for our categorisation task:

import os
from sagemaker_huggingface_inference_toolkit import transformers_utils

GPU_ID = 0
GPU_NOT_AVAILABLE_ID = -1

def model_fn(model_dir):
    """
    The Load handler is responsible for loading the Hugging Face transformer model.
    It can be overridden to load the model from storage
    Returns:
        hf_pipeline (Pipeline): A Hugging Face Transformer pipeline.
    """
    config_file = "config.json"
    if not config_file in os.listdir(model_dir):
        raise ValueError(f"{config_file} not found", 403)

    task = transformers_utils.infer_task_from_model_architecture(f"{model_dir}/{config_file}")
    device_id = GPU_ID if transformers_utils._is_gpu_available() else GPU_NOT_AVAILABLE_ID

    hf_pipeline = transformers_utils.get_pipeline(
        task=task,
        model_dir=model_dir,
        device=device_id,
        return_all_scores=True,
    )

    return hf_pipeline

But we’ve also tried the “zero-code” deployment and inference is just as slow using transformers 4.12.3.

The task is sequence classification, in this case there are 60 different possible targets. The model is loaded as:

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=60)

and then we fine tune on our labelled dataset. Let me know if you require any more information, thank you. Owen

Hey @philschmid happy new year!

Was wondering if you had any more thoughts on this issue?

We found that when we compiled the model for use with one of the inferentia endpoints and wrote our own predict_fn, we didn’t have any latency issues with the latest code version. Presumably then the original latency problem stems from making predictions using the huggingface pipeline object in the latest code version?

Thank you, Owen

Happy new year as well :confetti_ball:

Could potentially be possible. You could confirm this by adding a requirements.txt next to your inference.py script and adding a new transformers version.

Since you are using Inferentia already, I am happy to share that there will be soon HF-specific Inferentia DLCs.

Hey @philschmid just for completeness we figured out the issue here.
We pinned the latency increase down as occurring in the jump from version 4.10 → 4.11, when a lot of pipeline refactoring was done. As part of this the batch_size argument was introduced as an argument when creating a pipeline, with a default value of 1. Changing this value to something bigger solved our latency issue.