How to deploy a T5 model to AWS SageMaker for fast inference?


I just watched the video of the Workshop: Going Production: Deploying, Scaling & Monitoring Hugging Face Transformer models (11/02/2021) from Hugging Face.

With the informations about how to deploy (timeline start: 28:14), I created a notebook instance (type: ml.m5.xlarge) on AWS SageMaker where I did upload the notebook lab3_autoscaling.ipynb from huggingface-sagemaker-workshop-series >> workshop_2_going_production in github.

I ran it and got a inference time of about 70ms for the QA model (distilbert-base-uncased-distilled-squad). Great!

Then, I changed the model to be loaded from the HF model hub to t5-base with the following code:

hub = {
  'HF_MODEL_ID':'t5-base', # model_id from
  'HF_TASK':'translation' # NLP task you want to use for predictions

I did make the deploy through the following code:

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(

And then, I did launch an inference… but the inference time goes up to more than 700ms!

As in the video (timeline start: 57:05), @philschmid said that there are still models that can not be deployed this way, I would like to check if T5 models (up to ByT5) are optimized or not for inference in AWS SageMaker (quantization through ONNX for example or not)?

If they are not yet optimized (as it looks like), when will they be?

Note: I noticed the same problem about T5 inference through the Inference API (see this thread: How to get Accelerated Inference API for T5 models? ).

for large DL models such as transformers, inference on CPU is slower than on GPU. And T5 is much bigger than the distillbert used in the demo. 700ms is actually not that bad for a CPU transformer :slight_smile: try replacing m5.xlarge by g4dn.xlarge to reduce latency.

Hello @OlivierCR.

You are right about GPU vs CPU inference time but I’m doing tests with the same configuration for the 2 models (distilbert-base-uncased and T5 base).

About models size, we are not talking here of large DL models.

  • distilbert-base-uncased: 66 millions parameters (fonte) / Inference time: 70ms
  • T5 base: 220 million parameters (fonte) / Inference time: 700ms

There are 4 times more parameters in T5 base than distilbert-base-uncased, but its inference time is 10 times slower on the same instance (type: ml.m5.xlarge) of AWS SageMaker.

Clearly, I can use a better instance and it will improve the 2 inference times but without explaining the reasons of the low inference time for a Seq2Seq model as T5 base in AWS SageMaker.

I think that the T5 base is not optimized as the BERT models are in AWS SageMaker (through ONNX for example) but only the HF team can confirm or not I guess.

Hey @pierreguillou,

Thanks for opening the thread and I am happy to hear the workshop material was enough to get you started!

So currently the models aren’t optimized automatically. So when you would like to run optimized models you would need to optimize them currently by yourself and then provide them.

Regarding your speed assumption.

There are 4 times more parameters in T5 base than distilbert-base-uncased, but its inference time is 10 times slower on the same instance (type: ml.m5.xlarge ) of AWS SageMaker.

That’s because both models have different architecture and trained on different tasks and methods for inference. For example, T5 uses the .generate method with a beam search to create your translation, which means it is not running 1 forward pass through the model there can be multiple.
So the latency difference between distilbert and T5 makes sense and is not related to SageMaker.

1 Like

Hello @philschmid.

Thanks for your answer.

I’m not sure to understand. Clearly, a T5 model uses the .generate() method with a beam search to create a translation. However, the default value of beam search is 1, which means no beam search as written in the HF doc of the .generate() method:

**num_beams** ( int , optional, defaults to 1) – Number of beams for beam search. 1 means no beam search.

Therefore, by default in AWS SageMaker, there is only one forward pass through the model T5 (base) at each inference when predictor.predict(data) is launched, no? And if you confirm this point, it means that the distilbert model in AWS SageMaker DLC is optimized, and not the T5 model. What do you think?

Note: by the way, what would be the code in AWS SageMaker to increase the beam search argument in .generate()?

Well, I have a question: when HF will optimize Seq2Seq models like T5 in AWS SageMaker DLC?

Let’s say it will be only next year. That means I need to do it myself today. Could you first validate the following steps?

  1. Finetune a T5 base to a downstream task either in AWS SageMaker, either in another environment (GCP, local GPU, etc.)
  2. Compress to ONNX format (with fastT5 for example) the finetuned T5 base model.
  3. Upload the ONNX T5 base model to S3 in AWS
  4. Use the ONNX T5 base model in AWS SageMaker DLC in order to make inferences

The last question is: where can I find the code for steps 3 and 4?
Thanks for your help.

  1. Upload the ONNX T5 base model to S3 in AWS

You can use for example boto3 or the cli to upload files to S3 and you can find documentation on how the create the model.tar.gz here: Deploy models to Amazon SageMaker

  1. Use the ONNX T5 base model in AWS SageMaker DLC in order to make inferences

Currently, there is no example for using ONNX in SageMaker with the HF DLC, but you would need to create a custom as documented here: Deploy models to Amazon SageMaker and add the ONNX dependencies in a requirements.txt package everything and upload it to S3.

1 Like

Hi @philschmid.

I’m back to you about using AWS SageMaker for inference with a Text2Text-Generation model like T5.

My objective is to use an ONNX T5 model for inference but in order to understand the logic behind the SageMaker Hugging Face Inference Toolkit, I started with a T5 model from the HF hub.

I’m using for doing that your notebook deploy_transformer_model_from_hf_hub.ipynb.

It worked but I was surprised to get a different predicted text than the one I get when I use the model in a notebook.

As I understood that the deploy HF code in AWS SageMaker uses pipeline(), my hypothesis is that arguments like num_beams, max_length have default values that I need to change.

Then, my question is: how to change the values of theses arguments in a deploy from AWS SageMaker? Thanks.

@philschmid: very stange. I think I found how to pass parameters but when I pass the same parameters than the ones I used in a Colab notebook, I got 2 different predictions…

Code from my Colab notebook

model_name = "xxx"
API_TOKEN = 'xxxx' # API token 
max_target_length = 32 
num_beams = 1

text2text = pipeline(

# put a prefix before the text
input_text = "xxxxx" # one sentence

# get prediction
pred = text2text(input_text)[0]['generated_text']

# print result
print('input_text |',input_text)
print('prediction |',pred)

Code I use in the AWS SageMaker Deploy notebook

input_text = "xxxx"

data= {
    "parameters": {

# request

@pierreguillou when using generative models it is not guaranteed that the output is always exactly the same. Especially when converting the model to a ONNX Model.

What different output are you seeing? Are you using the same tranformers version?

Hi @philschmid. I’m with difficulty to understand your sentence. When I have a model (generative or not), the same text in input, and of course the same values for arguments of the generate() method (num_beams, etc.), I do not understand how the output (ie, the calculation by the model) could not be the same.

I just published a simple Colab notebook (generate_method_T5.ipynb) and run 1000 times the generate() method with the same input: the output is always the same (Pytorch and pipeline()).

Here, I understand. If you convert to ONNX format any model, you slightly change the values of the parameters of this model, and then this can create a different output than the corresponding Pytorch model (but again, always the same output from the same input).

As a proof of concept, in the same Colab notebook (generate_method_T5.ipynb), I used the library fastt5 in order to get an ONNX model from the T5 one.

From the question “When is the birthday of Pierre?”:

  • the Pytorch and pipeline() models give the answer “17 February
  • and the ONNX model gives “30 years, 160 days”.

I’m fine with that (at least, I understand it).

Yes. I’'m using transformers 4.15 (I did test fastt5 with version > 4.16 but I had error when using the generate() method).

Last point: in AWS Sagemaker Inference, I did not use the ONNX model but the Pytorch T5 model. As I saw a different output, it means that the DLC for inference does some compression on T5 (ONNX or similar?) that could explain the different output?

No SageMaker is not doing any compression on some sort of today.

Are you installing transformers==4.15 through a requirement.txt within SageMaker?

Thanks @philschmid for this information about T5 in Sagemaker Inference (no compression until today).

I used the translation script (I used the script locally in an AWS Sagemaker notebook instance as I did some changes in the script). It has a requirements.txt (see my modified content) but this file did not install `transformers==4.15:

# content of my modified requirements.txt file
datasets >= 1.16.0
sentencepiece != 0.1.92
sacrebleu >= 1.4.12
torch >= 1.3

Then, I did train my T5 model on AWS Sagemaker Training DLC with libraries versions from Reference >> Training DLC Overview. As showed in the following screenshot and code from my notebook, I used transformers==4.12.3 and Pytorch 1.9.1:

# 2.72.1

huggingface_estimator = HuggingFace(
      hyperparameters = hyperparameters,

Then, I uploaded my T5 model to HF model hub in private mode.

Finally, I did use AWS Sagemaker Inference with the same libraries versions in the following code:

from sagemaker.huggingface import HuggingFaceModel
import sagemaker 

role = sagemaker.get_execution_role()

hub = {
  'HF_MODEL_ID':'xxxxxxx', # model_id from
  'HF_API_TOKEN':"xxxxxxxx" # my API token

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   role=role, # iam role with permissions to create an Endpoint
   transformers_version="4.12.3", # transformers version used
   pytorch_version="1.9.1", # pytorch version used
   py_version="py38", # python version of the DLC

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(

input_text = "xxxx"

data= {
    "parameters": {
        "max_length":32, # same value than the one used for training
        "num_beams":1, # same value than the one used for training
        "early_stopping":True # same value than the one used for training

# request

However, as said in my post, the predictions from predictor.predict(data) is different than the ones I get in a Colab notebook with the same Pytorch model and same arguments (num_beams,…).

what do you think? Thank you for your help.

Hi @pierreguillou, I have the same problem. I created a post here. Did you solve the problem?

Hi @Gennaro

No. I’m still waiting an answer from Hugging Face (ie, @philschmid) about my post.