It should work fine if you use the extension⌠apparently:
This can be made to work. The cleanest route for your exact requirements is:
Use Continue with a local config.yaml, and point it at either
- a real OpenAI-compatible
/v1 endpoint, or
- a fuller Ollama-compatible shim than just
/api/generate. (Continue Docs)
The short reason is simple:
VS Code extensions do not care that the weights came from Hugging Face. They care about the HTTP protocol they are talking to. Continue explicitly supports both an Ollama provider with custom apiBase and an OpenAI-compatible provider with custom apiBase. AI Toolkit also supports custom Ollama endpoints and custom OpenAI-compatible endpoints. (Continue Docs)
What is probably happening in your case
Your server works for the one flow you tested with curl. The extension is likely trying more than that.
That is not speculation in the abstract. Continueâs Ollama implementation calls multiple endpoints, including GET /api/tags, POST /api/show, POST /api/chat, and POST /api/generate. There are also real Continue issues from users who could reach their server manually but then saw Continue request /api/show or /api/chat and fail. (GitHub)
So this part matters:
/api/generate alone is usually not enough.
If you only spoofed /api/generate, and later added /api/tags, that still leaves a gap for tools that probe /api/show and /api/chat. That fits your symptoms very well. (Ollama Documentation)
The background that makes this confusing
Inside VS Code, âAI codingâ is not one feature. It is usually several:
- chat
- edit/apply
- inline completion
- indexing/embeddings/model discovery
Different tools use different endpoints for those. Continueâs OpenAI-compatible provider docs even mention forcing legacy completions usage, which is a clue that not every feature goes through the same route. Continue also documents separate model roles and separate autocomplete setup. (Continue Docs)
That is why âmy server answers a prompt and returns responseâ is necessary but not sufficient.
The strongest recommendation for your setup
Use Continue first
Continue is the best match to what you asked for because it has:
- a documented offline / air-gapped guide
- documented local config
- explicit support for Ollama
- explicit support for OpenAI-compatible providers via
apiBase (Continue Docs)
Those are the clearest official explanations I found for âuse a local model in an IDE without cloud dependency.â (Continue Docs)
Do not start with VS Code built-in chat
VS Codeâs own docs say that when you use bring-your-own models for chat, the Copilot service API is still used for some tasks such as embeddings, repository indexing, query refinement, intent detection, and side queries. There are also issue reports explicitly asking for local models to work without GitHub login and completely offline, which means your complaint is shared by other users and is not solved by default. (Visual Studio Code)
So for your requirement of no login, no tracking, no tokens, no telemetry, VS Codeâs built-in path is the wrong first target. (Visual Studio Code)
Why âHugging Face, not Ollamaâ is the wrong dividing line
This is the key conceptual point.
âHugging Faceâ is where your model and tooling come from. âOllamaâ or âOpenAI-compatibleâ is the wire protocol your editor is speaking.
A Hugging Face model can sit behind:
- your own FastAPI wrapper
- TGI
- vLLM
- another OpenAI-compatible server
- an Ollama-like shim
The editor only sees the API. It does not know or care whether the weights originally came from Hugging Face. Continueâs OpenAI docs explicitly describe connecting to OpenAI-compatible providers via apiBase. AI Toolkit explicitly supports adding custom models with an OpenAI-compatible endpoint, and also custom Ollama endpoints. (Continue Docs)
So no, VS Code and Continue are not âintentionally incompatible with Hugging Face.â The real compatibility boundary is protocol shape, not model origin. (Continue Docs)
The two viable designs
Design A. Keep your current Ollama-style shim
This is the quickest path if you want to reuse your work.
But then implement a more complete Ollama subset:
GET /api/tags
POST /api/show
POST /api/chat
POST /api/generate
Those are all part of Ollamaâs documented API surface, and they are the same paths Continue users have reported seeing in practice. (Ollama Documentation)
The official Ollama API docs list generate, chat, embeddings, list models, and show model details. That matches the shape tools tend to expect. (Ollama Documentation)
Design B. Switch to an OpenAI-compatible /v1 endpoint
This is the cleaner long-term design.
Continue documents using provider: openai with a custom apiBase. AI Toolkit also documents adding a self-hosted or local model with an OpenAI-compatible endpoint. (Continue Docs)
For editor tooling, this is often easier to reuse across tools than a custom fake-Ollama server.
My view: Design B is better long-term. Design A is faster if you are already close.
The trap with OpenAI-compatible mode
Do not assume POST /v1/chat/completions is enough.
Continueâs docs mention legacy completions handling, and real user reports show cases where chat worked differently from edit/autocomplete because different endpoints were used. That means a backend that only supports chat-style calls may still fail in coding workflows. (Continue Docs)
So if you go OpenAI-compatible, expect to support at least the endpoints your chosen extension actually uses, not just the one you wish it used. (Continue Docs)
The clearest explanation of how to do it
The clearest official docs I found, in order, are:
-
Continue: How to Run Continue Without Internet
Best overall explanation for your privacy goal. It covers offline setup, local providers, and disabling telemetry. (Continue Docs)
-
Continue: How to Understand Hub vs Local Configuration
Best explanation of why local config.yaml is the right path for an offline or restricted setup. (Continue Docs)
-
Continue: How to Configure OpenAI Models with Continue
Best explanation if you want to expose your Hugging Face model through a custom /v1 server. (Continue Docs)
-
Continue: How to Configure Ollama with Continue
Best explanation if you want to keep your current âspoof Ollamaâ idea. (Continue Docs)
-
Ollama API introduction
Best reference for which /api/... endpoints an Ollama-style server normally exposes. (Ollama Documentation)
-
AI Toolkit model docs
Useful mainly to confirm that custom Ollama endpoints and OpenAI-compatible endpoints are officially supported concepts. (Visual Studio Code)
What I would do if I were solving your exact problem
I would do this in order.
Step 1. Stop testing multiple VS Code AI extensions at once
Pick Continue first. It has the clearest docs for offline local use, and you can fully control the config locally. (Continue Docs)
Step 2. Decide whether you want the fastest win or the cleanest architecture
If you want the fastest win, keep your current server and make it answer:
If you want the cleanest architecture, expose an OpenAI-compatible /v1 API and point Continueâs provider: openai at it. (Continue Docs)
Step 3. Use local Continue config
Continue documents local config as machine-local, offline-capable, and suitable for strict data policies. That matches your stated goal exactly. (Continue Docs)
A minimal shape looks like this:
name: Local Config
version: 1.0.0
schema: v1
models:
- name: Local HF via OpenAI API
provider: openai
model: qwen2.5-coder-3b
apiBase: http://127.0.0.1:8000/v1
That pattern follows Continueâs documented OpenAI-compatible configuration. (Continue Docs)
Or, if you keep the Ollama-style shim:
name: Local Config
version: 1.0.0
schema: v1
models:
- name: Local HF via Ollama Shim
provider: ollama
model: qwen2.5-coder:3b
apiBase: http://127.0.0.1:11434
That pattern follows Continueâs documented Ollama configuration. (Continue Docs)
Step 4. Disable everything nonessential for the first test
Do not try to solve chat, edit, autocomplete, indexing, and agents all at once.
First get one prompt-response loop working inside Continue chat. Then add edit. Then test inline completion. Continueâs docs and config model support this incremental approach. (Continue Docs)
About OpenClaw
OpenClaw is not intentionally incompatible with local Hugging Face models. Its current docs explicitly describe two local paths:
- native Ollama integration using
/api/chat
- OpenAI-compatible local servers such as vLLM (OpenClaw)
So the answer is not âOpenClaw rejects Hugging Face.â
But I would still not use OpenClaw as your next step. Why:
- its docs are aimed at a broader agent stack, not the simplest VS Code coding-assistant setup
- there are recent issues around custom local providers,
baseUrl, and provider routing (GitHub)
So OpenClaw may become viable later, but it is a worse first target than Continue for your current goal. (OpenClaw)
My bottom-line judgment
For your case:
- Your privacy requirement is reasonable.
- Your Hugging Face local model choice is not the blocker.
- Your current fake-Ollama endpoint is probably too incomplete for the extension you are testing.
- Continue is the best first extension to target.
- VS Code built-in Copilot/BYOK is not a good fit for strict no-login offline use.
- OpenClaw is not intentionally incompatible, but it is the wrong next battle. (Continue Docs)
The cleanest practical path is:
Continue + local config + either
- a more complete Ollama-style shim, or
- a proper OpenAI-compatible
/v1 server in front of your Hugging Face model. (Continue Docs)