Quantization reduces the size of the model but it also reduces accuracy.
context size increases the size of the model and also increases accuracy.
How should we decide between the two. For eg when running llama3 with 8 bit quantization.
If we utilize ollama to create the model from the gguf file it will utilize a default context size of 2K and produce a 10GB model. If we change the context size to 32k the size of the model changes to 15GB and now the model does not fit into the GPU …
How much effect does quantization have on accuracy and how much effect does context size have?
If context size matters much more we could utilize a 4Bit Quantized model with 32kb size window.