How to deploy a T5 model to AWS SageMaker for fast inference?

for large DL models such as transformers, inference on CPU is slower than on GPU. And T5 is much bigger than the distillbert used in the demo. 700ms is actually not that bad for a CPU transformer :slight_smile: try replacing m5.xlarge by g4dn.xlarge to reduce latency.