Should 8bit quantization make inference faster on GPU?

Inference on a sample file takes much longer (5x) if whisper-large-v3 is loaded in 8bit mode on NVIDIA T4 gpu.

Shouldn’t quantization improve inference speed on GPU?
https://pytorch.org/docs/stable/quantization.html

Also, GPU utilization is at 33% in nvidia-smi.

similar question:

model_8bit = AutoModelForSpeechSeq2Seq.from_pretrained(
     "openai/whisper-large-v3",
    device_map='auto',
    load_in_8bit=True)

sample = "sample.mp3" #27s long

from transformers.pipelines.audio_classification import ffmpeg_read
with torch.inference_mode():
    with open(sample, "rb") as f:
        inputs = f.read()
        inputs = ffmpeg_read(inputs, feature_extractor.sampling_rate)

        input_features = feature_extractor(inputs, sampling_rate = feature_extractor.sampling_rate, return_tensors='pt')['input_features']

        input_features = torch.tensor(input_features, dtype=torch.float16, device='cuda')

        forced_decoder_ids_output = model_8bit.generate(input_features=input_features, return_timestamps=False)

        out = tokenizer.decode(forced_decoder_ids_output.squeeze())
        print(out)
2 Likes

I’m a bit confused too but for the opposite reason. I thought that inferencing with quantization is supposed to be longer due to the dequantization process. I.e., the model has to convert the lower-precision format back to the higher-precision format using the stored quantization zero point and scale.

But from a few quick searches and a convo with ChatGPT, it seems like sometimes, for the sake of speed, inferencing is done in the lower-bit format (without ever bringing it back to the higher precision). And sometimes, some mixed-precision format is used.