Improving Whisper for Inference

RajSang · July 23, 2023, 8:24pm

Hi @sanchit-gandhi, I have trained a whisper-medium using QLoRa for ASR and would like to deploy it. I want to know what quantization/speed improvements I can make to deploy it (for CPU ideally). I looked into the issue of hallucinations when using 4/8 bit inference and also see that using half-precision is better. Can I use ONNX for my half-precison model? Or what about BetterTransformer? Thanks.

regisss · July 24, 2023, 7:41am

Hi @RajSang, you can check this Twitter thread for various Whisper benchmarks on GPU and CPU using Optimum and ONNXRuntime: https://twitter.com/IlysMoutawwakil/status/1667258837194383360

For BetterTransformer, it would be easy to benchmark it yourself using Optimum Benchmark: GitHub - huggingface/optimum-benchmark: A repository for benchmarking HF Optimum's optimizations for inference and training.

cc @IlyasMoutawwakil

RajSang · July 24, 2023, 8:19am

Thanks @regisss, this was very helpful! I had two more questions -

Where can I learn about the different ORT quantizations, what they mean and how to select one depending on hardware? In the Optimum Docs :Quantization , the following conceptual guide has a broken link https://huggingface.co/concept_guides/quantization

2.) Is it possible to apply both ORT optimizations and BetterTransformer to Whisper? More precisely, can I use something like ORT (O3 optimization) + ORT Quantization + BetterTransformer? If so does the order of applying these matter?

regisss · July 24, 2023, 8:35am

Good catch, I’ll fix this link. Here it is: Quantization
A BetterTransformer model is not exportable to ONNX at the moment, you can find more information in my reply here: Export a BetterTransformer to ONNX - #2 by regisss
But you can absolutely combine ORT optimization and quantization, even though I’m not sure if there is a general rule regarding which one should be applied first. I guess you’ll have to try both. Maybe @fxmarty or @IlyasMoutawwakil know more about this?

RajSang · July 26, 2023, 9:48am

Hi @regisss, thanks for the links! I see that the ORT Quantization supports 8-bit int quantization. I was wondering if directly loading the model in 4-bits is a better optimization (provided performance does not drop too much). Can I combine ORT optimization with the load_in_4bit = True option?

regisss · July 26, 2023, 1:37pm

Hi @RajSang, 4-bit quantization is not supported by ORT so this won’t work. And as you said, it depends on whether you try to optimize memory or latency.

pierreguillou · September 18, 2023, 8:26pm

Hi @regisss,

Does it means that the following code does not work or does not decrease the inference time?

(I’m using Llama2 here, but it could be Whisper)

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "meta-llama/Llama-2-7b-hf"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token 

model_4bit = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")
model_4bit = model_4bit.to_bettertransformer()

regisss · September 18, 2023, 9:09pm

@pierreguillou Do you mean combining BitsAndBytes with BetterTransformer?

By the way, for inference, GPTQ seems like a better option than BitsAndBytes: Overview of natively supported quantization schemes in 🤗 Transformers

pierreguillou · September 18, 2023, 9:38pm

Yes, but the creation of GPTQ model needs a GPU with at least 40 GB, no? (I did try with a Llama2 7B on a Colab T4 GPU by using the HF notebook but it is not possible).

regisss · September 19, 2023, 11:19am

Ah yes, generating the quantized model with GPTQ is more demanding than BitsAndBytes. I don’t know exactly what would the minimal GPU specs be for Whisper, @IlyasMoutawwakil any idea?

IlyasMoutawwakil · September 19, 2023, 12:03pm

BitsAndBytes and GPTQ can only be used with Pytorch because they use custom dtypes and kernels which are not compatible with ONNX.
The combination BitsAndBytes+BetterTransformer is possible and decreases latency (tested in the LLM-Perf Leaderboard with fp4).
GPTQ only supports text models, while BitsAndBytes is supposed to work with any model as long as it contains linear layers.
I think it’s possible to quantize llama-7b using GPTQ even on a T4 but you’ll need to force CPU offloading because llama-7b can be loaded on a T4 but requires more VRAM (~18GB) during inference. It seems accelerate’s auto dispatching doesn’t detect that and only uses GPU.

pierreguillou · September 20, 2023, 12:23am

Thank you @IlyasMoutawwakil. Do you have a notebook showing the code?

Topic		Replies	Views
Improving Quantization Accuracy for ONNX Models with Optimum 🤗Optimum	0	725	February 8, 2024
Optimum & T5 for inference 🤗Optimum	18	5807	February 8, 2023
How to optimize ONNX seq2seq model? 🤗Optimum	2	2132	August 25, 2022
Transformers.onnx vs optimum.onnxruntime 🤗Optimum	1	1135	September 12, 2022
🔧 Optimizing Phi-4 MM Instruct Vision Model (ONNX Inference) Intermediate	1	48	April 24, 2025

Improving Whisper for Inference

Related topics