CUDA convert GUFF to CUDA GUFF

spencer1129 · December 18, 2024, 3:42pm

Now I use llama.cpp with cuda.
And I load Meta-Llama-3.1-8B-Instruct-Q6_K.gguf from hugging but it was not for gpu.
So I have to rebuild it for gpu. How can I do it?

John6666 · December 18, 2024, 3:47pm

I think GGUF itself can be loaded onto the GPU without any problems. It is possible that Llamacpp has been built without GPU support. Llamacpp is a software that is difficult to build properly with GPU support…
It is safer to use the pre-built version.

If there is something wrong with GGUF itself, it is quicker to download it again.

spencer1129 · December 18, 2024, 3:49pm

I build llama.cpp with cuda enabled and load bartowski/Meta-Llama-3.1-8B-Instruct-GGUF](bartowski/Meta-Llama-3.1-8B-Instruct-GGUF · Hugging Face.
so I lanuch server and it works. But it doesn’t use GPU.

John6666 · December 18, 2024, 3:55pm

In that case, it is possible that there is not enough VRAM, or that parameters such as n_gpu_layers, n_ctx are not set appropriately. For information on settings, please refer to the following. If you try using a very small GGUF, you should be able to tell whether the problem is with VRAM or not.
Also, even if you build with CUDA specified, there are many cases where CUDA is not actually enabled.

github.com/abetlen/llama-cpp-python

Using the GPU is slower than the CPU

opened 04:07AM - 24 Jan 24 UTC

AreckOVO

bug performance

I installed llamacpp using the instructions below: CMAKE_ARGS="-DLLAMA_CUBLAS=o…n" pip install llama-cpp-python the speed: llama_print_timings: eval time = 81.91 ms / 2 runs ( 40.95 ms per token, 1.02 tokens per second) I installed llamacpp using the instructions below: pip install llama-cpp-python the speed: llama_print_timings: eval time = 81.91 ms / 2 runs ( 40.95 ms per token, 30.01 tokens per second) My code is as follows: result = self.model( prompt, # Prompt # max_tokens=nt, # Generate up to 32 tokens # stop=["Q:", "\n"], # Stop generating just before the model would generate a new question # echo=True # Echo the prompt back in the output ) So how can i use GPU to speed up?

spencer1129 · December 18, 2024, 3:59pm

yah I do it. And I use llama-server. I think it requires additional options for cuda. But I don’t know exactly!

John6666 · December 18, 2024, 4:06pm

I’ve never used it in server mode, but it seems that you can specify options using the method below. I think the effect is the same.

github.com

ggerganov/llama.cpp/blob/master/examples/server/README.md

# LLaMA.cpp HTTP Server

Fast, lightweight, pure C/C++ HTTP server based on [httplib](https://github.com/yhirose/cpp-httplib), [nlohmann::json](https://github.com/nlohmann/json) and **llama.cpp**.

Set of LLM REST APIs and a simple web front end to interact with llama.cpp.

**Features:**
 * LLM inference of F16 and quantized models on GPU and CPU
 * [OpenAI API](https://github.com/openai/openai-openapi) compatible chat completions and embeddings routes
 * Reranking endoint (WIP: https://github.com/ggerganov/llama.cpp/pull/9510)
 * Parallel decoding with multi-user support
 * Continuous batching
 * Multimodal (wip)
 * Monitoring endpoints
 * Schema-constrained JSON response format

The project is under active development, and we are [looking for feedback and contributors](https://github.com/ggerganov/llama.cpp/issues/4216).

## Usage

This file has been truncated. show original

github.com/abetlen/llama-cpp-python

Using the GPU is slower than the CPU

opened 04:07AM - 24 Jan 24 UTC

AreckOVO

bug performance

I installed llamacpp using the instructions below: CMAKE_ARGS="-DLLAMA_CUBLAS=o…n" pip install llama-cpp-python the speed: llama_print_timings: eval time = 81.91 ms / 2 runs ( 40.95 ms per token, 1.02 tokens per second) I installed llamacpp using the instructions below: pip install llama-cpp-python the speed: llama_print_timings: eval time = 81.91 ms / 2 runs ( 40.95 ms per token, 30.01 tokens per second) My code is as follows: result = self.model( prompt, # Prompt # max_tokens=nt, # Generate up to 32 tokens # stop=["Q:", "\n"], # Stop generating just before the model would generate a new question # echo=True # Echo the prompt back in the output ) So how can i use GPU to speed up?

system · December 19, 2024, 4:07am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Utilizing GPU for ggml model with Langchain Beginners	1	6665	August 19, 2023
Llama 70b model not using GPU Models	0	1120	September 13, 2023
How do I run this model Beginners	1	1910	November 7, 2023
Why I am getting this problem while running any of the GGUF model instead of with .bin model Models	1	2645	November 7, 2023
Running GGUF model files using Auto classes 🤗Transformers	2	2471	March 2, 2024

CUDA convert GUFF to CUDA GUFF

Related topics