Hi @sanchit-gandhi, I have trained a whisper-medium using QLoRa for ASR and would like to deploy it. I want to know what quantization/speed improvements I can make to deploy it (for CPU ideally). I looked into the issue of hallucinations when using 4/8 bit inference and also see that using half-precision is better. Can I use ONNX for my half-precison model? Or what about BetterTransformer? Thanks.
Hi @RajSang, you can check this Twitter thread for various Whisper benchmarks on GPU and CPU using Optimum and ONNXRuntime: https://twitter.com/IlysMoutawwakil/status/1667258837194383360
For BetterTransformer, it would be easy to benchmark it yourself using Optimum Benchmark: GitHub - huggingface/optimum-benchmark: A repository for benchmarking HF Optimum's optimizations for inference and training.
Thanks @regisss, this was very helpful! I had two more questions -
- Where can I learn about the different ORT quantizations, what they mean and how to select one depending on hardware? In the Optimum Docs :Quantization , the following conceptual guide has a broken link https://huggingface.co/concept_guides/quantization
2.) Is it possible to apply both ORT optimizations and BetterTransformer to Whisper? More precisely, can I use something like ORT (O3 optimization) + ORT Quantization + BetterTransformer? If so does the order of applying these matter?
- Good catch, I’ll fix this link. Here it is: Quantization
- A BetterTransformer model is not exportable to ONNX at the moment, you can find more information in my reply here: Export a BetterTransformer to ONNX - #2 by regisss
But you can absolutely combine ORT optimization and quantization, even though I’m not sure if there is a general rule regarding which one should be applied first. I guess you’ll have to try both. Maybe @fxmarty or @IlyasMoutawwakil know more about this?
Hi @regisss, thanks for the links! I see that the ORT Quantization supports 8-bit int quantization. I was wondering if directly loading the model in 4-bits is a better optimization (provided performance does not drop too much). Can I combine ORT optimization with the load_in_4bit = True option?
Hi @RajSang, 4-bit quantization is not supported by ORT so this won’t work. And as you said, it depends on whether you try to optimize memory or latency.
Does it means that the following code does not work or does not decrease the inference time?
(I’m using Llama2 here, but it could be Whisper)
import torch from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig model_id = "meta-llama/Llama-2-7b-hf" bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) tokenizer = AutoTokenizer.from_pretrained(model_id) tokenizer.pad_token = tokenizer.eos_token model_4bit = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto") model_4bit = model_4bit.to_bettertransformer()
@pierreguillou Do you mean combining BitsAndBytes with BetterTransformer?
By the way, for inference, GPTQ seems like a better option than BitsAndBytes: Overview of natively supported quantization schemes in 🤗 Transformers
Yes, but the creation of GPTQ model needs a GPU with at least 40 GB, no? (I did try with a Llama2 7B on a Colab T4 GPU by using the HF notebook but it is not possible).
Ah yes, generating the quantized model with GPTQ is more demanding than BitsAndBytes. I don’t know exactly what would the minimal GPU specs be for Whisper, @IlyasMoutawwakil any idea?
- BitsAndBytes and GPTQ can only be used with Pytorch because they use custom dtypes and kernels which are not compatible with ONNX.
- The combination BitsAndBytes+BetterTransformer is possible and decreases latency (tested in the LLM-Perf Leaderboard with
- GPTQ only supports text models, while BitsAndBytes is supposed to work with any model as long as it contains linear layers.
- I think it’s possible to quantize llama-7b using GPTQ even on a T4 but you’ll need to force CPU offloading because llama-7b can be loaded on a T4 but requires more VRAM (~18GB) during inference. It seems
autodispatching doesn’t detect that and only uses GPU.
Thank you @IlyasMoutawwakil. Do you have a notebook showing the code?