About using llama-cpp-python

ks99999 · July 8, 2024, 1:04am

I would like to create a chat demo of a quantized model using llama-cpp-python.

When I implement the Python code and try it with cpu_basic / free, I can chat.
cpu_basic is worked well.

However, when I upgrade to cpu upgrade, $0.03/h, it does not work.

Specifically, when I try to load the model with llama_cpp_python, there is no response after the function call and the subsequent processing is not executed.

I think the reason it does not work well when the CPU or memory is increased is because there is some insufficient setting or machine constraint.

If you have any solutions (suggestions), could you please tell me?

Topic		Replies	Views
CUDA convert GUFF to CUDA GUFF Models	6	154	December 18, 2024
Loading quantized model on CPU only 🤗Transformers	6	18482	February 3, 2025
Does llama-2 need pro subscription? Beginners	6	6416	November 24, 2023
Access locally downloaded Llama Model in Notebook Spaces	0	1508	October 24, 2023
Can we run custom quantized llama3-8b on Npu? Models	0	62	December 6, 2024

About using llama-cpp-python

Related topics