How to speed up Blenderbot inference with Sagemaker?

rlekhwani-umass · February 7, 2023, 7:55am

When I hit the inference API for the Blenderbot-400M-distill model, I get a response within 3 seconds. However, when I try to deploy the model on AWS Sagemaker, my prediction times range to about 10 seconds irrespective of whether I use a compute-optimized instance or a memory-optimized one.

from sagemaker.huggingface import HuggingFaceModel
import sagemaker

role = sagemaker.get_execution_role()
# Hub Model configuration. https://huggingface.co/models
hub = {
	'HF_MODEL_ID':'facebook/blenderbot-400M-distill',
	'HF_TASK':'conversational'
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
	transformers_version='4.17.0',
	pytorch_version='1.10.2',
	py_version='py38',
	env=hub,
	role=role, 
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
	initial_instance_count=1, # number of instances
	instance_type='ml.m4.4xlarge' # ec2 instance type
)

predictor.predict({
	'inputs': {
		"past_user_inputs": ["Which movie is the best ?"],
		"generated_responses": ["It's Die Hard for sure."],
		"text": "Can you explain why ?"
	}
})

I have even tried this with ml.c5.4xlarge and ml.m6g.4xlarge instances. The predicition time doesn’t improve.

Any suggestions would be really appreciated!

Reference: Amazon SageMaker Pricing - Machine Learning - Amazon Web Services

Topic		Replies	Views
How to deploy a T5 model to AWS SageMaker for fast inference? Amazon SageMaker	13	5787	February 28, 2022
Slow inference using most recent docker image Amazon SageMaker	10	3202	March 21, 2022
How to use fine tuned Hugging face model saved at S3 at inference time? Amazon SageMaker	1	5071	May 4, 2022
ClientErro:400 when using batch transformer for inference Amazon SageMaker	11	2225	January 13, 2022
Use my finetuned Bert Model in SageMaker BatchTransform Amazon SageMaker	4	2974	April 30, 2022

How to speed up Blenderbot inference with Sagemaker?

Related topics