Based on the great blog post : Deploy T5 11B for inference for less than $500
I would have 2 questions related to deployment of Flan-T5:
-
Latency and RPS : Would it be more efficient to perform [mixed precision and sharding + LLM.int8()] OR to [convert the model to ONNX + optimizations]?
-
Is there a way to know before load testing what GPU characteristics would be needed to run Flan-T5 for real-time inference?