Should I just get more RAM?

I have a PC with 32GB RAM and GPU with 16GB VRAM. It can run 11B model smoothly and it can generate response in 3-4s for a few hundred words(very rough estimation). But when I use 90B model, it is really slow, it spits out each word every 10-15s. I can see spikes in the GPU usage indicating there might be a lot of data transfer in between. Instead of get a better GPU with more VRAM. Does increase the system RAM help? How much improvement I can get?

1 Like

Below is a calculator that will calculate the amount of VRAM you will need to run your model. The size displayed by default is the size when using 4-bit quantization.
If you don’t quantize (usually 16-bit), you will need around 48GB for 11B size, and around 200GB for 90GB. If you quantize to 4-bit, you will need a quarter of that.

I think you are probably using it in a quantized state, either consciously or unconsciously. So, when you use a large model, there is not enough RAM, so the PC uses the SSD or HDD as virtual memory to compensate, which slows it down. If you have a little over 60GB of VRAM and RAM combined, I think it will run more smoothly than it does now. If you are going to add more RAM, I recommend adding 32GB more than you have now.
However, it is better to think that the extent to which the speed will increase by increasing the RAM is limited, because it is the VRAM that is really necessary to run LLM. However, it is definitely not enough for the 90B model, so if you are going to use it, you should at least increase the RAM. VRAM is too expensive.:sweat_smile:

Thanks @John6666 . This makes sense. A follow up question, what is the way to find out the quantization the model is using? For example, i got the model by using “ollama pull”. By looking at the ram and vram usage, i can guess it is probably 4bit. What are the other way to figure it out? Is it stored in a model file somewhere?

1 Like

Where the detailed information on quantization is stored depends on the quantization method, but in the case of Ollama, it was easy. It seems to be written on the official page.
It seems to be 4 bits of GGUF quantization, which is one of the quantizations widely used in HF. Looking at the list of Ollama model sizes, it seems that most of the models are the same. I also often use this Q4_K_M format. Even if it is reduced to this size, the results do not change much from before quantization. It is excellent. If it is reduced to a smaller size, it is a bit difficult. Some models are fine, but it is a gamble. Well, if it is Q4_K_M, it is safe.:grinning:

There are several users on HF who volunteer to make and distribute large quantities of GGUF, so it’s also fun to pick them up and use them.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.