Should 8bit quantization make inference faster on GPU?

I’m a bit confused too but for the opposite reason. I thought that inferencing with quantization is supposed to be longer due to the dequantization process. I.e., the model has to convert the lower-precision format back to the higher-precision format using the stored quantization zero point and scale.

But from a few quick searches and a convo with ChatGPT, it seems like sometimes, for the sake of speed, inferencing is done in the lower-bit format (without ever bringing it back to the higher precision). And sometimes, some mixed-precision format is used.