Deploying Huggingface Sagemaker Models with Elastic Inference

Hello @philschmid and Huggingface team,

We’d like to deploy a Roberta question answering model to SageMaker on an inferentia instance. However, it seems when we compile model with torch.neuron.trace, the model input size needs to be fixed. This presents a problem to question answering model where the context does not have to have fixed size. In fact, Huggingface’s question answering pipeline handles long context by breaking it up into overlapping chunks.

How would you suggest we solve this problem? And will Huggingface model hub eventually provide the capability to deploy a QA model to inferentia directly?

Thank you for your help!

Hello @jxiao,

you can check out this blog post on how to compile and deploy models to Inferentia: Accelerate BERT inference with Hugging Face Transformers and AWS Inferentia

1 Like