Hello @OlivierCR.
You are right about GPU vs CPU inference time but I’m doing tests with the same configuration for the 2 models (distilbert-base-uncased and T5 base).
About models size, we are not talking here of large DL models.
- distilbert-base-uncased: 66 millions parameters (fonte) / Inference time: 70ms
- T5 base: 220 million parameters (fonte) / Inference time: 700ms
There are 4 times more parameters in T5 base than distilbert-base-uncased, but its inference time is 10 times slower on the same instance (type: ml.m5.xlarge
) of AWS SageMaker.
Clearly, I can use a better instance and it will improve the 2 inference times but without explaining the reasons of the low inference time for a Seq2Seq model as T5 base in AWS SageMaker.
I think that the T5 base is not optimized as the BERT models are in AWS SageMaker (through ONNX for example) but only the HF team can confirm or not I guess.