Correct way to pass context to llama.cpp server

We are using llama.cpp server locally with deepseekv2 with following parameters.

# Run the server command
$BUILD_DIR/llama-server \
  -m $MODEL_PATH \
  --ctx-size 8192 \
  --parallel 1 \
  --n-gpu-layers -1 \
  --port 52555 \
  --threads $num_cores \
  --color \
  --metrics \
  --batch-size 1024 \
  --numa isolate \
  --mlock \
  --no-mmap \
  --grp-attn-n 2 \
  --grp-attn-w 512.0 \
      --defrag-thold 0.2 \
  --cont-batching \
    --cont-batching \
  --rope-scaling linear \
  --rope-scale 2 \
  --yarn-orig-ctx 2048 \
  --yarn-ext-factor 0.5 \
  --yarn-attn-factor 1.0 \
  --yarn-beta-slow 1.0 \
  --yarn-beta-fast 32.0 \
  --embeddings

This setup easily achieves 40 tokens/second on an Apple M1 machine for shorter prompts. However, issues arise when handling large contexts with the prompt, especially as we’re developing a GitHub-like code copilot tool.

In the initial phase, we tokenize the entire repository, break it into 256-token chunks, generate embeddings using all-MiniLM-L6-v2, create a Faiss index, and store it on disk. When a user sends a prompt while working on a file in the repository, we quickly retrieve relevant code chunks using Faiss and load them again. This process happens in microseconds without any issues.

However, when we attempt to retrieve more than one relevant context (k > 2), the server response time increases significantly, often returning irrelevant results.

The problem escalates when passing this context along with the user prompt to the server—it either hangs or breaks.

Here’s an example of how we construct the prompt:

DEEPSEEK_V2_PROMPT_TEMPLATE = """
<|begin of sentence|>{system_prompt}
User: {prompt}
Assistant: <|end of sentence|>Assistant:
"""
system_prompt = """
Answer the user query using the context.
Must ensure all code snippets are properly formatted and enclosed within \"\"\" \"\"\" in your response.
"""
prompt = f"Generate code snippet: {user_prompt} considering Context: {results}"
print(prompt)

full_prompt = DEEPSEEK_V2_PROMPT_TEMPLATE.format(system_prompt=system_prompt, prompt=prompt)

We then use a StreamingResponse to format the LLM server’s response:

def stream_result(full_prompt):
    return requests.post(
        f"{llm_server_url}/completions",
        data=json.dumps({
            "prompt": full_prompt,
            "stream": True,
            "temperature": float(temperature),
            "cache_prompt": True
        }),
        stream=True,
        timeout=500  # Set a timeout of 500 seconds
    )

Is there a better approach to speed up inference, or is this method fundamentally flawed for passing context to the Llama.cpp server?
Is there any other alternative to use LLama.cpp server to pass huge context
or
we should try QLora?

1 Like

In these server parameters, --cont-batching appears twice. Is this a typo, or does it mean something?