Correct way to pass context to llama.cpp server

graphicaldot · July 4, 2024, 2:11pm

We are using llama.cpp server locally with deepseekv2 with following parameters.

# Run the server command
$BUILD_DIR/llama-server \
  -m $MODEL_PATH \
  --ctx-size 8192 \
  --parallel 1 \
  --n-gpu-layers -1 \
  --port 52555 \
  --threads $num_cores \
  --color \
  --metrics \
  --batch-size 1024 \
  --numa isolate \
  --mlock \
  --no-mmap \
  --grp-attn-n 2 \
  --grp-attn-w 512.0 \
      --defrag-thold 0.2 \
  --cont-batching \
    --cont-batching \
  --rope-scaling linear \
  --rope-scale 2 \
  --yarn-orig-ctx 2048 \
  --yarn-ext-factor 0.5 \
  --yarn-attn-factor 1.0 \
  --yarn-beta-slow 1.0 \
  --yarn-beta-fast 32.0 \
  --embeddings

This setup easily achieves 40 tokens/second on an Apple M1 machine for shorter prompts. However, issues arise when handling large contexts with the prompt, especially as we’re developing a GitHub-like code copilot tool.

In the initial phase, we tokenize the entire repository, break it into 256-token chunks, generate embeddings using all-MiniLM-L6-v2, create a Faiss index, and store it on disk. When a user sends a prompt while working on a file in the repository, we quickly retrieve relevant code chunks using Faiss and load them again. This process happens in microseconds without any issues.

However, when we attempt to retrieve more than one relevant context (k > 2), the server response time increases significantly, often returning irrelevant results.

The problem escalates when passing this context along with the user prompt to the server—it either hangs or breaks.

Here’s an example of how we construct the prompt:

DEEPSEEK_V2_PROMPT_TEMPLATE = """
<|begin of sentence|>{system_prompt}
User: {prompt}
Assistant: <|end of sentence|>Assistant:
"""
system_prompt = """
Answer the user query using the context.
Must ensure all code snippets are properly formatted and enclosed within \"\"\" \"\"\" in your response.
"""
prompt = f"Generate code snippet: {user_prompt} considering Context: {results}"
print(prompt)

full_prompt = DEEPSEEK_V2_PROMPT_TEMPLATE.format(system_prompt=system_prompt, prompt=prompt)

We then use a StreamingResponse to format the LLM server’s response:

def stream_result(full_prompt):
    return requests.post(
        f"{llm_server_url}/completions",
        data=json.dumps({
            "prompt": full_prompt,
            "stream": True,
            "temperature": float(temperature),
            "cache_prompt": True
        }),
        stream=True,
        timeout=500  # Set a timeout of 500 seconds
    )

Is there a better approach to speed up inference, or is this method fundamentally flawed for passing context to the Llama.cpp server?
Is there any other alternative to use LLama.cpp server to pass huge context
or
we should try QLora?

paulmcq · August 16, 2024, 2:59pm

In these server parameters, --cont-batching appears twice. Is this a typo, or does it mean something?

Topic		Replies	Views
Llama-2 7B-hf repeats context of question directly from input prompt, cuts off with newlines 🤗Transformers	16	28984	January 10, 2025
How to set Llama-2-Chat prompt context Models	2	15497	October 18, 2023
How can I prompt Llama to only use my provided context? 🤗Transformers	1	1662	March 2, 2024
Number of tokens (550) exceeded maximum context length (512) error Beginners	2	306	September 12, 2024
Code Llama Instruct 34B accepts only 4096 tokens on PRO Site Feedback	0	606	January 11, 2024

Correct way to pass context to llama.cpp server

Related topics