Should 8bit quantization make inference faster on GPU?

buio · December 13, 2023, 10:23pm

Inference on a sample file takes much longer (5x) if whisper-large-v3 is loaded in 8bit mode on NVIDIA T4 gpu.

Shouldn’t quantization improve inference speed on GPU?
https://pytorch.org/docs/stable/quantization.html

Also, GPU utilization is at 33% in nvidia-smi.

similar question:

https://discuss.huggingface.co/t/enabling-load-in-8bit-makes-inference-much-slower/38596

model_8bit = AutoModelForSpeechSeq2Seq.from_pretrained(
     "openai/whisper-large-v3",
    device_map='auto',
    load_in_8bit=True)

sample = "sample.mp3" #27s long

from transformers.pipelines.audio_classification import ffmpeg_read
with torch.inference_mode():
    with open(sample, "rb") as f:
        inputs = f.read()
        inputs = ffmpeg_read(inputs, feature_extractor.sampling_rate)

        input_features = feature_extractor(inputs, sampling_rate = feature_extractor.sampling_rate, return_tensors='pt')['input_features']

        input_features = torch.tensor(input_features, dtype=torch.float16, device='cuda')

        forced_decoder_ids_output = model_8bit.generate(input_features=input_features, return_timestamps=False)

        out = tokenizer.decode(forced_decoder_ids_output.squeeze())
        print(out)

blastwind · April 1, 2024, 2:49am

I’m a bit confused too but for the opposite reason. I thought that inferencing with quantization is supposed to be longer due to the dequantization process. I.e., the model has to convert the lower-precision format back to the higher-precision format using the stored quantization zero point and scale.

But from a few quick searches and a convo with ChatGPT, it seems like sometimes, for the sake of speed, inferencing is done in the lower-bit format (without ever bringing it back to the higher precision). And sometimes, some mixed-precision format is used.

Topic		Replies	Views
Enabling load_in_8bit makes inference much slower 🤗Transformers	3	1797	February 13, 2024
Mistral load_in_8bit slow inference 🤗Transformers	0	245	May 24, 2024
Inference time in TGI quantization Intermediate	0	164	May 21, 2024
Load_in_8bit vs. loading 8-bit quantized model 🤗Transformers	6	6808	May 13, 2024
Question about memory usage Beginners	0	929	May 15, 2023

Should 8bit quantization make inference faster on GPU?

Related topics