Deploying Huggingface Sagemaker Models with Elastic Inference

jxiao · November 8, 2022, 4:30am

Hello @philschmid and Huggingface team,

We’d like to deploy a Roberta question answering model to SageMaker on an inferentia instance. However, it seems when we compile model with torch.neuron.trace, the model input size needs to be fixed. This presents a problem to question answering model where the context does not have to have fixed size. In fact, Huggingface’s question answering pipeline handles long context by breaking it up into overlapping chunks.

How would you suggest we solve this problem? And will Huggingface model hub eventually provide the capability to deploy a QA model to inferentia directly?

Thank you for your help!

philschmid · November 8, 2022, 6:59am

Hello @jxiao,

you can check out this blog post on how to compile and deploy models to Inferentia: Accelerate BERT inference with Hugging Face Transformers and AWS Inferentia

Topic		Replies	Views
Error deploying BERT on SageMaker Amazon SageMaker	20	5285	April 1, 2025
Getting error in the inference stage of Transformers Model (Hugging Face) 🤗Transformers	0	782	October 11, 2022
Inference Hyperparameters Amazon SageMaker	29	4824	October 8, 2021
Deploying Mixtral8x7B on AWS Sagemaker from S3 Amazon SageMaker	2	481	June 11, 2024
Unable to deploy to SageMaker via Studio notebook Amazon SageMaker	1	428	October 12, 2023

Deploying Huggingface Sagemaker Models with Elastic Inference

Related topics