Now I use llama.cpp with cuda.
And I load Meta-Llama-3.1-8B-Instruct-Q6_K.gguf from hugging but it was not for gpu.
So I have to rebuild it for gpu. How can I do it?
1 Like
I think GGUF itself can be loaded onto the GPU without any problems. It is possible that Llamacpp has been built without GPU support. Llamacpp is a software that is difficult to build properly with GPU support…
It is safer to use the pre-built version.
If there is something wrong with GGUF itself, it is quicker to download it again.
I build llama.cpp with cuda enabled and load bartowski/Meta-Llama-3.1-8B-Instruct-GGUF](bartowski/Meta-Llama-3.1-8B-Instruct-GGUF · Hugging Face .
so I lanuch server and it works. But it doesn’t use GPU.
1 Like
In that case, it is possible that there is not enough VRAM, or that parameters such as n_gpu_layers, n_ctx are not set appropriately. For information on settings, please refer to the following. If you try using a very small GGUF, you should be able to tell whether the problem is with VRAM or not.
Also, even if you build with CUDA specified, there are many cases where CUDA is not actually enabled.
opened 04:07AM - 24 Jan 24 UTC
bug
performance
I installed llamacpp using the instructions below:
CMAKE_ARGS="-DLLAMA_CUBLAS=o… n" pip install llama-cpp-python
the speed:
llama_print_timings: eval time = 81.91 ms / 2 runs ( 40.95 ms per token, 1.02 tokens per second)
I installed llamacpp using the instructions below:
pip install llama-cpp-python
the speed:
llama_print_timings: eval time = 81.91 ms / 2 runs ( 40.95 ms per token, 30.01 tokens per second)
My code is as follows:
result = self.model(
prompt, # Prompt
# max_tokens=nt, # Generate up to 32 tokens
# stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
# echo=True # Echo the prompt back in the output
)
So how can i use GPU to speed up?
yah I do it. And I use llama-server. I think it requires additional options for cuda. But I don’t know exactly!
1 Like
I’ve never used it in server mode, but it seems that you can specify options using the method below. I think the effect is the same.
# LLaMA.cpp HTTP Server
Fast, lightweight, pure C/C++ HTTP server based on [httplib](https://github.com/yhirose/cpp-httplib), [nlohmann::json](https://github.com/nlohmann/json) and **llama.cpp**.
Set of LLM REST APIs and a simple web front end to interact with llama.cpp.
**Features:**
* LLM inference of F16 and quantized models on GPU and CPU
* [OpenAI API](https://github.com/openai/openai-openapi) compatible chat completions and embeddings routes
* Reranking endoint (WIP: https://github.com/ggerganov/llama.cpp/pull/9510)
* Parallel decoding with multi-user support
* Continuous batching
* Multimodal (wip)
* Monitoring endpoints
* Schema-constrained JSON response format
The project is under active development, and we are [looking for feedback and contributors](https://github.com/ggerganov/llama.cpp/issues/4216).
## Usage
This file has been truncated. show original
opened 04:07AM - 24 Jan 24 UTC
bug
performance
I installed llamacpp using the instructions below:
CMAKE_ARGS="-DLLAMA_CUBLAS=o… n" pip install llama-cpp-python
the speed:
llama_print_timings: eval time = 81.91 ms / 2 runs ( 40.95 ms per token, 1.02 tokens per second)
I installed llamacpp using the instructions below:
pip install llama-cpp-python
the speed:
llama_print_timings: eval time = 81.91 ms / 2 runs ( 40.95 ms per token, 30.01 tokens per second)
My code is as follows:
result = self.model(
prompt, # Prompt
# max_tokens=nt, # Generate up to 32 tokens
# stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
# echo=True # Echo the prompt back in the output
)
So how can i use GPU to speed up?
system
Closed
December 19, 2024, 4:07am
7
This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.