Load model efficiently using llama.cpp

fairp · September 6, 2024, 2:42pm

Hello everyone, are there any best practices for using an LLM with the llama.cpp server? I mean specific parameters that should be used when loading the model, regardless of its size. I’m trying to use TheBloke/Mixtral-8x7B-v0.1-GGUF, but it’s quite large and sometimes it doesn’t provide answers at all. The Ollama Server, which also offers the ability to use models from the Ollama website, does a really good job of loading the model automatically. Do you have any ideas on how to optimize the llama.cpp server for using TheBloke/Mixtral-8x7B-v0.1-GGUF? Any advice would be appreciated. Thank you!

Topic		Replies	Views
Llama2-70b-chat loading Cuda Out of Memory Models	0	1222	July 26, 2023
Why the model loading of llama2 is so slow? 🤗Transformers	6	9634	April 24, 2024
Host a Model with vllm for RAG Models	6	3732	September 12, 2024
OOM issues with exported vs. model card models Models	1	298	March 9, 2021
How much memory required to load T0pp Models	4	3726	October 20, 2021

Load model efficiently using llama.cpp

Related topics