I’m a pure newbie in this AI topic. I need some help.
My hardware is Ryzen 5 7600, 16 GB RAM, and NVidia RTX 3050 8GB graphics card, with Linux 6.11.0-1016-lowlatency x86_64, NVidia’s open driver 570.158.01 and CUDA 12.4.127 in the virtual environment running in Ubuntu 24.04.2 LTS.
It’s not a really powerful computer, but it’s not that bad to work that slow.
Deepseek’s inference is completely slow working with transformers interface. I recently installed BitsAndBytes.
I’d like to know which are the recommended parammeters to start CLI chatbot interface, or IPython’s one. That’s the reason to write this post.
Please, have into account I’m just a newbie. I’ve not kept in touch with any AI’s model before. My model is open-R1 from HuggingFace.
Thank you very much.
With those GPU specs, I think you’d be better off using Ollama or vLLM to easily leverage the performance. In particular, the GGUF format used with Ollama isn’t particularly fast, but it’s an excellent format for offloading when VRAM is limited.
Generally speaking, Transformers offer advanced customization options, but if you don’t need that level of customization, other backends are easier to use.
How can I use this things?
I have just installed ollama.
Thank you very much. Excuse me, I’m a newbie.
Ollama is the quickest option.
Once you download and install Ollama, you should be able to launch it directly from the command line.
Ollama runs as a local server and can also be used as an API server compatible with the OpenAI API, so if you want to create your own chatbot GUI, you can access it via that API. It also handles load balancing.
Incidentally, you can achieve the same thing with Transformers using TGI. It’s also possible with vLLM.