How to deploy a T5 model to AWS SageMaker for fast inference?

Hello @OlivierCR.

You are right about GPU vs CPU inference time but I’m doing tests with the same configuration for the 2 models (distilbert-base-uncased and T5 base).

About models size, we are not talking here of large DL models.

  • distilbert-base-uncased: 66 millions parameters (fonte) / Inference time: 70ms
  • T5 base: 220 million parameters (fonte) / Inference time: 700ms

There are 4 times more parameters in T5 base than distilbert-base-uncased, but its inference time is 10 times slower on the same instance (type: ml.m5.xlarge) of AWS SageMaker.

Clearly, I can use a better instance and it will improve the 2 inference times but without explaining the reasons of the low inference time for a Seq2Seq model as T5 base in AWS SageMaker.

I think that the T5 base is not optimized as the BERT models are in AWS SageMaker (through ONNX for example) but only the HF team can confirm or not I guess.