Model size-quantization tradeoff for local offline inference

I am working on a project where I need to offline preprocess a large text corpora with an LLM and structured output. I am trying to run the model locally on 16GB VRAM (though my GPU takes a few GB for the desktop apps)

My current developing solution uses Llama 3.1 8B with 8bit Bitsandbytes quantization and the outlines package for structured output. I am thinking about using AWQ or GPTQ 4 bit quants for speed, but am not sure if this will sacrifice too much quality. I tried loading larger models like Mistral Nemo and Qwen 2.5 Coder 14B with AWQ but I am struggling to allocate enough memory for the models, and when they do fit, results have been subpar.

Should I stick with BnB or keep trying stuff around AWQ and/or GPTQ?

1 Like

Well, if you’re doing 4-bit quantization, I think GPTQ and GGUF are slightly more accurate than BNB.

When processing long texts, it seems that the effect of reducing the size with quantization is significant, but on the other hand, there are many use cases where you don’t have to worry about accuracy with BNB’s NF4, so I think there are many things you won’t know until you try it… I’ll list some comparisons.