Inference time in TGI quantization

xavier7778 · May 21, 2024, 9:21am

I have been inferencing a 7B llm model using TGI on runpod. till now i have been directly loading the model without quantisation.
But recently when i tried quantisation of model on 4bit and 8bit using TGI on same gpu configurations. After quantisation the model was taking more inference time in both 8-bit and 4-bit.
I was using bitsandbytes for quantisation.

Is this correct or am i missing something? shouldn’t the model inference time also reduce?

Topic		Replies	Views
Inference 8 bit or 4 bit bit models on cpu? Beginners	2	3107	August 3, 2023
T5 inference performance Models	5	1564	March 8, 2022
Should 8bit quantization make inference faster on GPU? 🤗Transformers	1	668	April 1, 2024
Correct Usage of BitsAndBytesConfig 🤗Transformers	4	29824	March 18, 2023
Load_in_8bit vs. loading 8-bit quantized model 🤗Transformers	6	6638	May 13, 2024

Inference time in TGI quantization

Related topics