Improving Whisper for Inference

  • BitsAndBytes and GPTQ can only be used with Pytorch because they use custom dtypes and kernels which are not compatible with ONNX.
  • The combination BitsAndBytes+BetterTransformer is possible and decreases latency (tested in the LLM-Perf Leaderboard with fp4).
  • GPTQ only supports text models, while BitsAndBytes is supposed to work with any model as long as it contains linear layers.
  • I think it’s possible to quantize llama-7b using GPTQ even on a T4 but you’ll need to force CPU offloading because llama-7b can be loaded on a T4 but requires more VRAM (~18GB) during inference. It seems accelerate’s auto dispatching doesn’t detect that and only uses GPU.
2 Likes