Quntisation LLM models

tesemnikov-av · November 18, 2024, 7:32am

Am I correct in understanding that Q_4_K_S quantization means 4-bit quantization with k-quantization(K) and a small(S) block size? What does a small block size mean? Q_3 means 3 bit quantization? Thanks!

John6666 · November 18, 2024, 8:51am

I can’t say I understand it correctly either, so I tried searching for it. Well, if you’re in trouble, use Q4_K_M.
https://www.reddit.com/r/LocalLLaMA/comments/1d1sc50/gguf_weight_encoding_suffixes_is_there_a_guide/

gist.github.com

https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9

README.md

# Which GGUF is right for me? (Opinionated)

Good question! I am collecting human data on how quantization affects outputs. See here for more information: https://github.com/ggerganov/llama.cpp/discussions/5962

In the meantime, use the largest that fully fits in your GPU. If you can comfortably fit Q4_K_S, try using a model with more parameters.

# llama.cpp feature matrix

See the wiki upstream: https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix

This file has been truncated. show original

Topic		Replies	Views
Finetuned LLM model conversion to GGUF - performance drop Models	4	1993	July 31, 2024
4 Bit quantization 🤗Optimum	4	561	August 11, 2023
Does quantization compress the model weights? Research	16	420	September 26, 2024
Model size-quantization tradeoff for local offline inference Intermediate	1	160	February 7, 2025
Downloaded models Beginners	14	2292	September 15, 2024

Quntisation LLM models

Related topics