Model size-quantization tradeoff for local offline inference

hksquinson · February 6, 2025, 6:32am

I am working on a project where I need to offline preprocess a large text corpora with an LLM and structured output. I am trying to run the model locally on 16GB VRAM (though my GPU takes a few GB for the desktop apps)

My current developing solution uses Llama 3.1 8B with 8bit Bitsandbytes quantization and the outlines package for structured output. I am thinking about using AWQ or GPTQ 4 bit quants for speed, but am not sure if this will sacrifice too much quality. I tried loading larger models like Mistral Nemo and Qwen 2.5 Coder 14B with AWQ but I am struggling to allocate enough memory for the models, and when they do fit, results have been subpar.

Should I stick with BnB or keep trying stuff around AWQ and/or GPTQ?

John6666 · February 7, 2025, 6:52am

Well, if you’re doing 4-bit quantization, I think GPTQ and GGUF are slightly more accurate than BNB.

When processing long texts, it seems that the effect of reducing the size with quantization is significant, but on the other hand, there are many use cases where you don’t have to worry about accuracy with BNB’s NF4, so I think there are many things you won’t know until you try it… I’ll list some comparisons.

github.com/bitsandbytes-foundation/bitsandbytes

What is the relation between bitsandbytes and gptq?

opened 05:26PM - 24 Jun 23 UTC

closed 02:51AM - 14 Jul 23 UTC

siddhsql

question

Hello, I would like to understand what is the relation or difference between bit…sandbytes and gptq e.g., [this](https://github.com/qwopqwop200/GPTQ-for-LLaMa)? as I understand so far, bnb does quantization of an unquantized model at runtime whereas gptq is used to load an already quantized model in gptq format. is that correct? would it be also correct to say one should use one or the other (i.e., either bnb or gptq) but not both simultaneously? and both of these do require GPU - specifically CUDA? thanks for any clarification.

Topic		Replies	Views
Diff between GPTQ and NF4 with bitsandbytes 🤗Transformers	0	1257	August 1, 2023
How to run large LLMs like Llama 3.1 70B or Mixtral 8x22B with limited GPU VRAM? Beginners	2	1883	September 26, 2024
Qunatized model with LORA takes much more GPU memory than the un-quantized model with LORA for the (E-5-Large Embedding Transformer) 🤗Transformers	4	1814	October 8, 2023
4-bit quantization Intermediate	0	475	November 18, 2023
4 Bit quantization 🤗Optimum	4	561	August 11, 2023

Model size-quantization tradeoff for local offline inference

Related topics