Using a Hugging Face Model offline to support code generation in VSCode

I am trying to use a Hugging Face offline http://localhost:11434/api/generate (spoofing ollama) access model with VSCode. After I get success there, I might try it with openclaw.

I am unable to get VSCode to access the model.

I have tried Continue, LM Studio, CodeGPT and AI Tools in VSCode. I run into a wall of non-functionality, or a demand that I log in via Google or something, I want no logins. With AI Tools, it tried once to access /api/tags, so I looked up what ollama returns for tags, and wrote code to spoof that in my api. I just want VSCode to send a prompt to (spoof ollama interface) wait for the dictionary with a “response” variable, and use it when it comes.

I am a beginner in AI/Python/VSCode, not a beginner in a lot of “old” languages. I have:

  • Downloaded example code for use of Qwen-2.5 Coder 3B offline on my GPU (6GB)
  • Used the LLM model to help me learn Python to expand my code into a chatbot (copy&paste from chatbot to VSCode, tinker, debug, expand again…)
  • Used model to learn more Python in two other expansions.
  • Developed code to run the model as a (spoof ollama) interface, using uvicorn, tested it with curl, and created a /api/chatbot interface too, used it there. I can tell from connections printed to that terminal window for my uvicorn server when there is an attempt to contact the local LLM.

I want to use Hugging Face, not Ollama. I want completely private sessions, no tokens, no tracking, no telemetry, no logins. I have achieved that with Hugging Face. The model I chose, it just works with my card, I will try others later.

If VSCode (OpenClaw) is just intentionally incompatible with Hugging Face, fine; a link to an explanation why would be appreciated.

If this can be made to work, please provide a link to the clearest explanation of how.

Thank you

1 Like

It should work fine if you use the extension… apparently:


This can be made to work. The cleanest route for your exact requirements is:

Use Continue with a local config.yaml, and point it at either

  1. a real OpenAI-compatible /v1 endpoint, or
  2. a fuller Ollama-compatible shim than just /api/generate. (Continue Docs)

The short reason is simple:

VS Code extensions do not care that the weights came from Hugging Face. They care about the HTTP protocol they are talking to. Continue explicitly supports both an Ollama provider with custom apiBase and an OpenAI-compatible provider with custom apiBase. AI Toolkit also supports custom Ollama endpoints and custom OpenAI-compatible endpoints. (Continue Docs)

What is probably happening in your case

Your server works for the one flow you tested with curl. The extension is likely trying more than that.

That is not speculation in the abstract. Continue’s Ollama implementation calls multiple endpoints, including GET /api/tags, POST /api/show, POST /api/chat, and POST /api/generate. There are also real Continue issues from users who could reach their server manually but then saw Continue request /api/show or /api/chat and fail. (GitHub)

So this part matters:

/api/generate alone is usually not enough.

If you only spoofed /api/generate, and later added /api/tags, that still leaves a gap for tools that probe /api/show and /api/chat. That fits your symptoms very well. (Ollama Documentation)

The background that makes this confusing

Inside VS Code, “AI coding” is not one feature. It is usually several:

  • chat
  • edit/apply
  • inline completion
  • indexing/embeddings/model discovery

Different tools use different endpoints for those. Continue’s OpenAI-compatible provider docs even mention forcing legacy completions usage, which is a clue that not every feature goes through the same route. Continue also documents separate model roles and separate autocomplete setup. (Continue Docs)

That is why “my server answers a prompt and returns response” is necessary but not sufficient.

The strongest recommendation for your setup

Use Continue first

Continue is the best match to what you asked for because it has:

  • a documented offline / air-gapped guide
  • documented local config
  • explicit support for Ollama
  • explicit support for OpenAI-compatible providers via apiBase (Continue Docs)

Those are the clearest official explanations I found for “use a local model in an IDE without cloud dependency.” (Continue Docs)

Do not start with VS Code built-in chat

VS Code’s own docs say that when you use bring-your-own models for chat, the Copilot service API is still used for some tasks such as embeddings, repository indexing, query refinement, intent detection, and side queries. There are also issue reports explicitly asking for local models to work without GitHub login and completely offline, which means your complaint is shared by other users and is not solved by default. (Visual Studio Code)

So for your requirement of no login, no tracking, no tokens, no telemetry, VS Code’s built-in path is the wrong first target. (Visual Studio Code)

Why “Hugging Face, not Ollama” is the wrong dividing line

This is the key conceptual point.

“Hugging Face” is where your model and tooling come from. “Ollama” or “OpenAI-compatible” is the wire protocol your editor is speaking.

A Hugging Face model can sit behind:

  • your own FastAPI wrapper
  • TGI
  • vLLM
  • another OpenAI-compatible server
  • an Ollama-like shim

The editor only sees the API. It does not know or care whether the weights originally came from Hugging Face. Continue’s OpenAI docs explicitly describe connecting to OpenAI-compatible providers via apiBase. AI Toolkit explicitly supports adding custom models with an OpenAI-compatible endpoint, and also custom Ollama endpoints. (Continue Docs)

So no, VS Code and Continue are not “intentionally incompatible with Hugging Face.” The real compatibility boundary is protocol shape, not model origin. (Continue Docs)

The two viable designs

Design A. Keep your current Ollama-style shim

This is the quickest path if you want to reuse your work.

But then implement a more complete Ollama subset:

  • GET /api/tags
  • POST /api/show
  • POST /api/chat
  • POST /api/generate

Those are all part of Ollama’s documented API surface, and they are the same paths Continue users have reported seeing in practice. (Ollama Documentation)

The official Ollama API docs list generate, chat, embeddings, list models, and show model details. That matches the shape tools tend to expect. (Ollama Documentation)

Design B. Switch to an OpenAI-compatible /v1 endpoint

This is the cleaner long-term design.

Continue documents using provider: openai with a custom apiBase. AI Toolkit also documents adding a self-hosted or local model with an OpenAI-compatible endpoint. (Continue Docs)

For editor tooling, this is often easier to reuse across tools than a custom fake-Ollama server.

My view: Design B is better long-term. Design A is faster if you are already close.

The trap with OpenAI-compatible mode

Do not assume POST /v1/chat/completions is enough.

Continue’s docs mention legacy completions handling, and real user reports show cases where chat worked differently from edit/autocomplete because different endpoints were used. That means a backend that only supports chat-style calls may still fail in coding workflows. (Continue Docs)

So if you go OpenAI-compatible, expect to support at least the endpoints your chosen extension actually uses, not just the one you wish it used. (Continue Docs)

The clearest explanation of how to do it

The clearest official docs I found, in order, are:

  1. Continue: How to Run Continue Without Internet
    Best overall explanation for your privacy goal. It covers offline setup, local providers, and disabling telemetry. (Continue Docs)

  2. Continue: How to Understand Hub vs Local Configuration
    Best explanation of why local config.yaml is the right path for an offline or restricted setup. (Continue Docs)

  3. Continue: How to Configure OpenAI Models with Continue
    Best explanation if you want to expose your Hugging Face model through a custom /v1 server. (Continue Docs)

  4. Continue: How to Configure Ollama with Continue
    Best explanation if you want to keep your current “spoof Ollama” idea. (Continue Docs)

  5. Ollama API introduction
    Best reference for which /api/... endpoints an Ollama-style server normally exposes. (Ollama Documentation)

  6. AI Toolkit model docs
    Useful mainly to confirm that custom Ollama endpoints and OpenAI-compatible endpoints are officially supported concepts. (Visual Studio Code)

What I would do if I were solving your exact problem

I would do this in order.

Step 1. Stop testing multiple VS Code AI extensions at once

Pick Continue first. It has the clearest docs for offline local use, and you can fully control the config locally. (Continue Docs)

Step 2. Decide whether you want the fastest win or the cleanest architecture

If you want the fastest win, keep your current server and make it answer:

If you want the cleanest architecture, expose an OpenAI-compatible /v1 API and point Continue’s provider: openai at it. (Continue Docs)

Step 3. Use local Continue config

Continue documents local config as machine-local, offline-capable, and suitable for strict data policies. That matches your stated goal exactly. (Continue Docs)

A minimal shape looks like this:

name: Local Config
version: 1.0.0
schema: v1

models:
  - name: Local HF via OpenAI API
    provider: openai
    model: qwen2.5-coder-3b
    apiBase: http://127.0.0.1:8000/v1

That pattern follows Continue’s documented OpenAI-compatible configuration. (Continue Docs)

Or, if you keep the Ollama-style shim:

name: Local Config
version: 1.0.0
schema: v1

models:
  - name: Local HF via Ollama Shim
    provider: ollama
    model: qwen2.5-coder:3b
    apiBase: http://127.0.0.1:11434

That pattern follows Continue’s documented Ollama configuration. (Continue Docs)

Step 4. Disable everything nonessential for the first test

Do not try to solve chat, edit, autocomplete, indexing, and agents all at once.

First get one prompt-response loop working inside Continue chat. Then add edit. Then test inline completion. Continue’s docs and config model support this incremental approach. (Continue Docs)

About OpenClaw

OpenClaw is not intentionally incompatible with local Hugging Face models. Its current docs explicitly describe two local paths:

  • native Ollama integration using /api/chat
  • OpenAI-compatible local servers such as vLLM (OpenClaw)

So the answer is not “OpenClaw rejects Hugging Face.”

But I would still not use OpenClaw as your next step. Why:

  • its docs are aimed at a broader agent stack, not the simplest VS Code coding-assistant setup
  • there are recent issues around custom local providers, baseUrl, and provider routing (GitHub)

So OpenClaw may become viable later, but it is a worse first target than Continue for your current goal. (OpenClaw)

My bottom-line judgment

For your case:

  • Your privacy requirement is reasonable.
  • Your Hugging Face local model choice is not the blocker.
  • Your current fake-Ollama endpoint is probably too incomplete for the extension you are testing.
  • Continue is the best first extension to target.
  • VS Code built-in Copilot/BYOK is not a good fit for strict no-login offline use.
  • OpenClaw is not intentionally incompatible, but it is the wrong next battle. (Continue Docs)

The cleanest practical path is:

Continue + local config + either

  • a more complete Ollama-style shim, or
  • a proper OpenAI-compatible /v1 server in front of your Hugging Face model. (Continue Docs)

Hi there,

Thanks for the detailed explanation it really helps clarify exactly what you’re trying to do. From what you’ve described, it sounds like your setup is solid on the Hugging Face offline model side, and the main challenge is getting VSCode extensions to talk to your local API without requiring logins or external accounts.

Here’s a breakdown of what’s happening and some guidance:

1. Why VSCode extensions are giving login prompts

Most VSCode AI extensions (Continue, LM Studio, CodeGPT, AI Tools, OpenClaw) are designed to work with cloud-hosted AI services, which usually require:

  • API keys / OAuth logins (Google, GitHub, Hugging Face tokens)

  • Specific API endpoints with expected authentication

If you try to point them at a local model, they often still try to reach the cloud or validate a token, which is why you’re hitting walls or errors.

Some extensions, like AI Tools, let you specify a custom endpoint, but they often expect the endpoint to return specific JSON formats (tags, metadata, etc.), which is why you had to spoof /api/tags.

2. Using a local Hugging Face model with VSCode

Since you want fully offline, private sessions, the cleanest approach is usually not to rely on prebuilt VSCode extensions. Instead, you can:

  • Run your local API (uvicorn server) as you have.

  • Write a small Python wrapper in VSCode that sends prompts to your /api/generate or /api/chatbot endpoint and reads the response.

  • You can even attach this to a VSCode task or a Jupyter notebook cell to interactively test prompts.

This avoids the login/authentication issues entirely, and you have full control over the request/response format.

A minimal example might look like this:

import requests

url = "http://localhost:11434/api/generate"
payload = {"prompt": "Hello, can you explain Python functions?", "max_tokens": 100}

response = requests.post(url, json=payload)
data = response.json()

print(data.get("response"))

This directly queries your local Hugging Face model, waits for the dictionary with "response", and prints it—no Google logins, no tokens, fully offline.


3. Why some VSCode integrations may never fully support Hugging Face offline

  • Many extensions are tightly coupled to cloud APIs for features like token usage tracking, conversation history, and context management.

  • Hugging Face local models don’t provide the same cloud API endpoints, so extensions can’t natively talk to them without custom adapters.

Unless the extension explicitly supports a custom HTTP endpoint with your JSON structure, you’ll keep running into these issues.


:white_check_mark: Recommended path forward

  1. Keep your uvicorn server + local Hugging Face model as you have.

  2. Use a custom Python script or notebook in VSCode to interact with the model.

  3. Optionally, write a lightweight VSCode extension yourself to call your API if you want editor integration—this is doable without external login.


For a step-by-step guide, Hugging Face has a great tutorial for running models locally via Python:

https://huggingface.co/docs/transformers/installation

And for building custom APIs to interface with VSCode or other clients:

https://fastapi.tiangolo.com/tutorial/


TL;DR:

VSCode AI extensions often expect cloud APIs with logins. For fully offline Hugging Face models, the most reliable approach is to talk to your local API directly via Python, rather than forcing the extensions to work.

You already have everything set up you just need a lightweight wrapper in VSCode to send prompts and handle responses.

1 Like

Thank you for the detailed AI responses.

For clarity, I was not trying different pluggins at once, I tried Continue, LM Studio, CodeGPT and AI Tools in VSCode, separately, and deleted each before trying the next.

The Continue docs for running offline, without internet, the guide here:

provides only one link, in point 3, that points to model-providers/ollama, and that page is blank. There is no specific configuration given there.

My understanding is Ollama and Hugging Face provide their own ecosystems, interface code, to load and set up models, pass prompts to the tokenizer, and then read the response. The includes I use in Python, the functions I call, seem to be unique to the Hugging Face ecosystem. While the model downloaded may (or may not) be the same when run through the Ollama and Hugging Face ecosystems, the front end is different. I prefer the Hugging Face ecosystem which I have tested and proved in my tests to not make any connection to the internet after the model is loaded onto the local disk by a different program, and provided the environment variables:

os.environ[‘TRANSFORMERS_OFFLINE’] = ‘1’
os.environ[‘HF_DATASETS_OFFLINE’] = ‘1’

are defined near the beginning of the code. I use the linux “strace” command to verify this:

strace -f -e connect -s 10000 -o trace.log python3 MyCode.py

Note the -f option follows all processes that are spawned by the code. Zero connections made. Ollama cannot do better than zero, so I prefer to stick with what I have tested.

@hellencharless54 my research indicates, to create a personal extension to VSCode, written in Python, I must write a typescript (or javascript) wrapper to connect VSCode to the Python script. To do this I have to use npm and generator-code and Yeoman (yo). I have had no previous experience in typescript or javascript, npm, Yeoman, or anything like that. I will see what I can get my LLM to write for me but right now the details and requirements and scope of the project for writing a VSCode extension are fuzzy to me.

I am thinking changing to an OpenAI compatible API format for my uvicorn server is probably a better idea.

I got the following code as a starting point:

import uvicorn
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from typing import List, Optional

# 1. Load Model and Tokenizer
MODEL_ID = "gpt2" # Replace with your local model path or HF hub ID
print(f"Loading model: {MODEL_ID}...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID)

# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

app = FastAPI(title="Local HuggingFace OpenAI-Compatible API")

# 2. Define OpenAI-compatible Schemas
class ChatMessage(BaseModel):
    role: str
    content: str

class ChatCompletionRequest(BaseModel):
    model: str
    messages: List[ChatMessage]
    temperature: Optional[float] = 0.7
    max_tokens: Optional[int] = 50

class ChatCompletionResponse(BaseModel):
    choices: list

# 3. API Endpoint
@app.post("/v1/chat/completions", response_model=ChatCompletionResponse)
async def chat_completions(request: ChatCompletionRequest):
    # Convert chat messages to a single prompt
    prompt = "\n".join([msg.content for msg in request.messages])
    
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
    # Generate response
    with torch.no_grad():
        output_ids = model.generate(
            **inputs, 
            max_new_tokens=request.max_tokens,
            temperature=request.temperature,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    response_text = tokenizer.decode(output_ids[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    
    # Format as OpenAI response
    return {
        "choices": [{
            "message": {"role": "assistant", "content": response_text},
            "finish_reason": "stop",
            "index": 0
        }]
    }

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)
```
1 Like

So I got it working, sort of, starting with modifications like those above. Continue / VSCode so stuffs the prompt space with general rules, even in chat mode, that my small model, running on a smaller VRAM GPU card, just seized up. Yes, it was taking 2 minutes to respond to a short prompt in a CURL proof of interface, but given the entire prompt provided by VSCode, it just sat there. I inserted print statements to make sure it was getting to various parts of the code, what the entire request was, what the inputs to the model were, and marked the start and end of generation, and it just hung up with no result for some time. I went away, did other things, and came back to VSCode timed out, and my uvicorn server saying it had finished – I did not print the output on the server side, I guess I should, just to make sure it did not completely fail to generate.

I guess I will continue copying and pasting from my offline Chatbot to generate code.

FYI, this is what Continue or VSCode inserted in front of my test input in chat:

Prompt :<important_rules>
  You are in chat mode.

  If the user asks to make changes to files offer that they can use the Apply Button on the code block, or switch to Agent Mode to make the suggested updates automatically.
  If needed concisely explain to the user they can switch to agent mode using the Mode Selector dropdown and provide no other details.

  Always include the language and file name in the info string when you write code blocks.
  If you are editing "src/main.py" for example, your code block should start with '```python src/main.py'

  When addressing code modification requests, present a concise code snippet that
  emphasizes only the necessary changes and uses abbreviated placeholders for
  unmodified sections. For example:

  ```language /path/to/file
  // ... existing code ...

  {{ modified code here }}

  // ... existing code ...

  {{ another modification }}

  // ... rest of code ...
  ```

  In existing files, you should always restate the function or class that the snippet belongs to:

  ```language /path/to/file
  // ... existing code ...

  function exampleFunction() {
    // ... existing code ...

    {{ modified code here }}

    // ... rest of function ...
  }

  // ... rest of code ...
  ```

  Since users have access to their complete file, they prefer reading only the
  relevant modifications. It's perfectly acceptable to omit unmodified portions
  at the beginning, middle, or end of files using these "lazy" comments. Only
  provide the complete file when explicitly requested. Include a concise explanation
  of changes unless the user specifically asks for code only.

</important_rules>
```

The matter is closed.

1 Like

Plugins don’t seem to work very well with autocomplete features.

Also, regardless of the model size, when handling long context lengths with an LLM, if you don’t choose the right attention backend, performance can drop to absurdly slow levels… This is likely to be a problem for coding tasks.
And if you’re using an older generation of GPUs, you may have fewer options for attention backends.


The highest-leverage improvements are on the prompt path, model-role split, and memory settings.

1. Make autocomplete the first success target

In Continue, rules are included in Agent, Chat, and Edit, but not in autocomplete or apply. Continue also currently recommends QwenCoder2.5 1.5B and QwenCoder2.5 7B as strong open autocomplete models. That makes autocomplete much lighter than chat on a small local machine. (Continue Docs)

That means your first target should be:

  • fast inline completion
  • small prompt window
  • small output
  • no extra repo context at first

Only after that is stable should you optimize chat/edit.

2. Split chat and autocomplete into separate model roles

Continue supports model roles such as chat, edit, and autocomplete, with separate settings for each. For your setup, that is the cleanest architecture. Use one model for chat/edit and a smaller, faster one for autocomplete. (Continue Docs)

A practical pattern is:

  • chat/edit: your current instruct model
  • autocomplete: a smaller coder model such as QwenCoder2.5 1.5B

That avoids forcing one local model to satisfy two very different latency targets. (Continue Docs)

3. Shrink the base system prompt hard

Continue’s config supports baseSystemMessage, and its rules system appends rules into the system message for Chat, Agent, and Edit. So if the local model is slow, the fastest gain usually comes from replacing the default long instruction block with something minimal and removing extra rules until the loop is stable. (Continue Docs)

A good first-pass chat system prompt is just:

baseSystemMessage: "You are a local coding assistant. Be brief. Prefer minimal diffs."

That is not magic. It just cuts prompt weight.

4. Cap prompt and output size aggressively

Continue exposes the exact settings you need:

  • defaultCompletionOptions.contextLength
  • defaultCompletionOptions.maxTokens
  • requestOptions.timeout
  • autocompleteOptions.maxPromptTokens
  • autocompleteOptions.modelTimeout
  • autocompleteOptions.onlyMyCode
  • autocompleteOptions.useImports
  • autocompleteOptions.useRecentlyEdited
  • autocompleteOptions.useRecentlyOpened (Continue Docs)

For a first stable setup, I would start around:

  • chat/edit contextLength: 4096
  • chat/edit maxTokens: 128 or 256
  • autocomplete maxPromptTokens: 256 to 384
  • autocomplete maxTokens: 32 to 64
  • onlyMyCode: true
  • useImports: false
  • useRecentlyEdited: false
  • useRecentlyOpened: false

Those numbers are my recommendation, not a Continue default. The reason is simple: prompt growth is usually what turns “slow but usable” into “appears broken.”

5. Increase timeouts before judging the setup

Continue supports request-level timeout controls and autocomplete timeout controls in config. If the model is working but slow, short client timeouts can make a valid local setup look dead. (Continue Docs)

A reasonable first pass is:

requestOptions:
  timeout: 180000

and for autocomplete:

autocompleteOptions:
  modelTimeout: 12000

That gives the local model more room while still keeping inline completion from hanging forever.

6. If you stay on Ollama, use the memory-saving switches

Ollama documents two settings that matter a lot for long context on small VRAM:

  • OLLAMA_FLASH_ATTENTION=1
  • OLLAMA_KV_CACHE_TYPE=q8_0 or q4_0 (Ollama)

Ollama states that Flash Attention can significantly reduce memory usage as context grows, and that KV-cache quantization reduces memory further. It documents q8_0 as using about half the memory of f16 with small quality loss, while q4_0 uses about one quarter with more noticeable degradation. (Ollama)

So the safest first server setting is:

OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve

Then only move to q4_0 if memory is still too tight.

7. Keep context modest first

Ollama’s context-length docs say that on systems with under 24 GiB VRAM, the default context is 4K, and that larger context needs more memory. They also recommend using ollama ps to check whether the model remains fully on GPU or gets partially offloaded to CPU. (Ollama)

So on a 6 GB class machine, I would not begin with 32K or 64K. I would test in this order:

  • 4096
  • 8192
  • 16384

and stop increasing once ollama ps shows CPU offload or latency becomes unusable. That step-up sequence is my recommendation, based on Ollama’s documented VRAM guidance. (Ollama)

8. Use a cleaner Continue config

A minimal local config for your setup could look like this:

name: Local HF
version: 1.0.0
schema: v1

models:
  - name: local-chat
    provider: openai
    apiBase: http://127.0.0.1:8000/v1
    model: qwen2.5-coder-3b-instruct
    roles: [chat, edit]
    baseSystemMessage: "You are a local coding assistant. Be brief. Prefer minimal diffs."
    defaultCompletionOptions:
      contextLength: 4096
      maxTokens: 256
      temperature: 0.2
    requestOptions:
      timeout: 180000

  - name: local-autocomplete
    provider: openai
    apiBase: http://127.0.0.1:8000/v1
    model: qwen2.5-coder-1.5b-instruct
    roles: [autocomplete]
    autocompleteOptions:
      debounceDelay: 400
      maxPromptTokens: 384
      modelTimeout: 12000
      onlyMyCode: true
      useImports: false
      useRecentlyEdited: false
      useRecentlyOpened: false
    defaultCompletionOptions:
      temperature: 0.1
      maxTokens: 64

Every field used there is documented by Continue’s config reference. The exact values are tuned for a small local setup. (Continue Docs)

9. Keep Agent mode off until the basics are solid

Continue’s agent flow sends tools along with chat requests and can loop through tool calls and tool results. That makes the prompt path heavier and more complex than plain chat or autocomplete. On constrained local hardware, agent mode is a later step, not the first one. (Continue Docs)

10. My priority order

If I were tuning your setup, I would do it in this order:

  1. Get autocomplete working fast with a small model. (Continue Docs)
  2. Strip chat down to a tiny base system message and no extra rules. (Continue Docs)
  3. Keep context at 4K–8K first. (Ollama)
  4. Turn on Flash Attention and q8_0 KV cache if using Ollama. (Ollama)
  5. Increase timeout before concluding the integration is failing. (Continue Docs)
  6. Only then add larger context, more rules, or agent features. (Continue Docs)

The main idea is simple: reduce the problem from “local IDE assistant” to “one small model, one small prompt, one fast feature.”