Optimizations and cloud instance characteristics for Flan-T5 real-time inference

Based on the great blog post : Deploy T5 11B for inference for less than $500

I would have 2 questions related to deployment of Flan-T5:

  1. Latency and RPS : Would it be more efficient to perform [mixed precision and sharding + LLM.int8()] OR to [convert the model to ONNX + optimizations]?

  2. Is there a way to know before load testing what GPU characteristics would be needed to run Flan-T5 for real-time inference?

3 Likes