When using Llama 3.2 (3b version) and comparing it to chat-gpt, it just doesn’t measure up. Not only is it making a lot of grammatical errors, it is also not following instructions as in summarize this.
Llama 3.2 (3b version) is in love with self care. So much so that it recommends self-care when asking how to draw a circle. Chat-Gpt does not.
Chat-Gpt is hilarious at using sarcasm. I love to use “comment on this news article in the most sarcastic way”.
Llama 3.2 (3b version) … well at least it likes self care.
Llama 3.2 (3b version) stands for local, private, chatgpt for this will be used against you.
But Llama 3.2 (3b version) seems incredibly bad compared to chatgpt.
I would love to have an AI comment on my most private thoughts, but Llama 3.2 (3b version) would rather promote self-care, talking to others. And talking to a lawyer if your friend stops talking to you to see your legal options(it actually wrote that).
My computer has 12 GB of VRAM.
What could I do to have an AI with good output but running on those 12 GB - or in part on the 12 GB VRAM and the rest on 64 GB RAM.
1 Like
So you expect that a 3B perform like ChatGPT, which is based on a hugely larger model (in the region of 1700B)?
1 Like
Thank you for pushing the thread.
So the question is: What could I do to have an AI with good output but running on those 12 GB - or in part on the 12 GB VRAM and the rest on 64 GB RAM.
1 Like
With a 12 Gb VRAM and 64 Gb RAM you may easily run more capable models, especially if you use quantized models, which take much less space than the original ones as they encode parameters using 4 to 8 bit per parameter rather than 16. To make your life easier, I suggest you to install Ollama (www.ollama.com) and use it to download various models to compare. You may get them from https://ollama.com/library : I recommend the standard quantization, which in these days is Q4_K_M and takes around 5 bits per weight.
You may start with 7B to 22B models, which should fit entirely in VRAM and therefore run fast. For example, after Ollama is installed, you may load and run Qwen2.5 14B with the command:
ollama run qwen2.5:14b
(this command line is copied to your clipboard when your browser shows the page at https://ollama.com/library/qwen2.5:14b and you click on the button on the right of the size selector).
After the loading is completed you’ll see a >>>
prompt and you’ll be able to start a dialogue.
With all that VRAM and RAM that you’ve got, you might be able to run the excellent Llama3.3-70B:
https://ollama.com/library/llama3.3:70b
Load and run with: ollama run llama3.3
(after the initial loading, subsequent executions of the “ollama run <model_name>” command will just run it). If you want to test its capabilities before loading it, you may access it hosted here, at www.huggingface.co/chat .
For other commands to Ollama, enter ollama help
.
Have fun!
1 Like
That was elaborate 
I am already using Ollama.
1 Like
Great, then let me know how the other models work for you. On my laptop with 4070 GPU, 8 Gb RAM and 32 Gb RAM I managed to run the Q_2 quantization of Llama3.3 70B: it’s much slower than on HuggingChat (1 or 2 tokens per second) but the quality is still good despite the extreme quantization.
For fun, you may want to try “Chain-of-Thoughts” models such as Marco-o1 or QwQ, both available in the Ollama library. They give you a fascinating insight of those model’s reasoning.
1 Like