Improving Whisper for Inference

IlyasMoutawwakil · September 19, 2023, 12:03pm

BitsAndBytes and GPTQ can only be used with Pytorch because they use custom dtypes and kernels which are not compatible with ONNX.
The combination BitsAndBytes+BetterTransformer is possible and decreases latency (tested in the LLM-Perf Leaderboard with fp4).
GPTQ only supports text models, while BitsAndBytes is supposed to work with any model as long as it contains linear layers.
I think it’s possible to quantize llama-7b using GPTQ even on a T4 but you’ll need to force CPU offloading because llama-7b can be loaded on a T4 but requires more VRAM (~18GB) during inference. It seems accelerate’s auto dispatching doesn’t detect that and only uses GPU.

Topic		Replies	Views
4 Bit quantization 🤗Optimum	4	553	August 11, 2023
Optimum library optimization and quantization fails 🤗Optimum	8	1606	February 22, 2025
Optimum & RoBERTa: how far can we trust a quantized model against its pytorch version? 🤗Optimum	10	2421	July 27, 2022
Should 8bit quantization make inference faster on GPU? 🤗Transformers	1	676	April 1, 2024
Convert OpenAI whisper transformer model to Quantized tflite model 🤗Transformers	1	2392	November 7, 2023