Optimizations and cloud instance characteristics for Flan-T5 real-time inference

Matthieu · February 7, 2023, 1:47pm

I would have 2 questions related to deployment of Flan-T5:

Latency and RPS : Would it be more efficient to perform [mixed precision and sharding + LLM.int8()] OR to [convert the model to ONNX + optimizations]?
Is there a way to know before load testing what GPU characteristics would be needed to run Flan-T5 for real-time inference?

Topic		Replies	Views
Container build failed Inference Endpoints on the Hub	1	466	February 21, 2023
T5 memory usage Beginners	0	1864	May 14, 2023
Impossible to use flan-t5-xxl in a batch-transform job Amazon SageMaker	3	1148	May 23, 2023
T5 inference performance Models	5	1564	March 8, 2022
Deploying inference model size and performance 🤗Transformers	6	5187	July 9, 2024