Should 8bit quantization make inference faster on GPU?

blastwind · April 1, 2024, 2:49am

I’m a bit confused too but for the opposite reason. I thought that inferencing with quantization is supposed to be longer due to the dequantization process. I.e., the model has to convert the lower-precision format back to the higher-precision format using the stored quantization zero point and scale.

But from a few quick searches and a convo with ChatGPT, it seems like sometimes, for the sake of speed, inferencing is done in the lower-bit format (without ever bringing it back to the higher precision). And sometimes, some mixed-precision format is used.

Topic		Replies	Views
Enabling load_in_8bit makes inference much slower 🤗Transformers	3	1795	February 13, 2024
Mistral load_in_8bit slow inference 🤗Transformers	0	245	May 24, 2024
Inference time in TGI quantization Intermediate	0	164	May 21, 2024
Load_in_8bit vs. loading 8-bit quantized model 🤗Transformers	6	6753	May 13, 2024
Question about memory usage Beginners	0	924	May 15, 2023

Should 8bit quantization make inference faster on GPU?

Related topics