How to deploy a T5 model to AWS SageMaker for fast inference?


I just watched the video of the Workshop: Going Production: Deploying, Scaling & Monitoring Hugging Face Transformer models (11/02/2021) from Hugging Face.

With the informations about how to deploy (timeline start: 28:14), I created a notebook instance (type: ml.m5.xlarge) on AWS SageMaker where I did upload the notebook lab3_autoscaling.ipynb from huggingface-sagemaker-workshop-series >> workshop_2_going_production in github.

I ran it and got a inference time of about 70ms for the QA model (distilbert-base-uncased-distilled-squad). Great!

Then, I changed the model to be loaded from the HF model hub to t5-base with the following code:

hub = {
  'HF_MODEL_ID':'t5-base', # model_id from
  'HF_TASK':'translation' # NLP task you want to use for predictions

I did make the deploy through the following code:

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(

And then, I did launch an inference… but the inference time goes up to more than 700ms!

As in the video (timeline start: 57:05), @philschmid said that there are still models that can not be deployed this way, I would like to check if T5 models (up to ByT5) are optimized or not for inference in AWS SageMaker (quantization through ONNX for example or not)?

If they are not yet optimized (as it looks like), when will they be?

Note: I noticed the same problem about T5 inference through the Inference API (see this thread: How to get Accelerated Inference API for T5 models? ).

for large DL models such as transformers, inference on CPU is slower than on GPU. And T5 is much bigger than the distillbert used in the demo. 700ms is actually not that bad for a CPU transformer :slight_smile: try replacing m5.xlarge by g4dn.xlarge to reduce latency.

Hello @OlivierCR.

You are right about GPU vs CPU inference time but I’m doing tests with the same configuration for the 2 models (distilbert-base-uncased and T5 base).

About models size, we are not talking here of large DL models.

  • distilbert-base-uncased: 66 millions parameters (fonte) / Inference time: 70ms
  • T5 base: 220 million parameters (fonte) / Inference time: 700ms

There are 4 times more parameters in T5 base than distilbert-base-uncased, but its inference time is 10 times slower on the same instance (type: ml.m5.xlarge) of AWS SageMaker.

Clearly, I can use a better instance and it will improve the 2 inference times but without explaining the reasons of the low inference time for a Seq2Seq model as T5 base in AWS SageMaker.

I think that the T5 base is not optimized as the BERT models are in AWS SageMaker (through ONNX for example) but only the HF team can confirm or not I guess.

Hey @pierreguillou,

Thanks for opening the thread and I am happy to hear the workshop material was enough to get you started!

So currently the models aren’t optimized automatically. So when you would like to run optimized models you would need to optimize them currently by yourself and then provide them.

Regarding your speed assumption.

There are 4 times more parameters in T5 base than distilbert-base-uncased, but its inference time is 10 times slower on the same instance (type: ml.m5.xlarge ) of AWS SageMaker.

That’s because both models have different architecture and trained on different tasks and methods for inference. For example, T5 uses the .generate method with a beam search to create your translation, which means it is not running 1 forward pass through the model there can be multiple.
So the latency difference between distilbert and T5 makes sense and is not related to SageMaker.

Hello @philschmid.

Thanks for your answer.

I’m not sure to understand. Clearly, a T5 model uses the .generate() method with a beam search to create a translation. However, the default value of beam search is 1, which means no beam search as written in the HF doc of the .generate() method:

**num_beams** ( int , optional, defaults to 1) – Number of beams for beam search. 1 means no beam search.

Therefore, by default in AWS SageMaker, there is only one forward pass through the model T5 (base) at each inference when predictor.predict(data) is launched, no? And if you confirm this point, it means that the distilbert model in AWS SageMaker DLC is optimized, and not the T5 model. What do you think?

Note: by the way, what would be the code in AWS SageMaker to increase the beam search argument in .generate()?

Well, I have a question: when HF will optimize Seq2Seq models like T5 in AWS SageMaker DLC?

Let’s say it will be only next year. That means I need to do it myself today. Could you first validate the following steps?

  1. Finetune a T5 base to a downstream task either in AWS SageMaker, either in another environment (GCP, local GPU, etc.)
  2. Compress to ONNX format (with fastT5 for example) the finetuned T5 base model.
  3. Upload the ONNX T5 base model to S3 in AWS
  4. Use the ONNX T5 base model in AWS SageMaker DLC in order to make inferences

The last question is: where can I find the code for steps 3 and 4?
Thanks for your help.