- BitsAndBytes and GPTQ can only be used with Pytorch because they use custom dtypes and kernels which are not compatible with ONNX.
- The combination BitsAndBytes+BetterTransformer is possible and decreases latency (tested in the LLM-Perf Leaderboard with
fp4
). - GPTQ only supports text models, while BitsAndBytes is supposed to work with any model as long as it contains linear layers.
- I think it’s possible to quantize llama-7b using GPTQ even on a T4 but you’ll need to force CPU offloading because llama-7b can be loaded on a T4 but requires more VRAM (~18GB) during inference. It seems
accelerate
’sauto
dispatching doesn’t detect that and only uses GPU.
2 Likes