Deploying a GGUF model on Inference Endpoints: 404 File Not Found

massimogiuseppe · September 15, 2025, 8:38am

Hi everyone!
For the past few weeks i’ve been finetuning a model, realized it was too slow at inference, so decided to go to “gguf-my-repo“ and quantize it to 4bit GGUF.

I have previously deployed models on Inference endpoints but never a GGUF model.
When spinning up the server, i’ve left most of the configurations ‘default’.
The server spins properly “Running“. To test the endpoint i make a POST on Postman and get the following error:

{“error”: {“code”: 404, “message”: “File Not Found”, “type”: “not_found_error” }}

The error does not refer to the model file not being found as i can see it has been correctly identified on the UI of Inference Endpoints.
I’ve even tried changing the URL destination on Postman to ‘/chat/completions‘ but the error persists…

I’m lost, any help?

John6666 · September 15, 2025, 12:08pm

For Llama.cpp or vLLM endpoints, it seems like v1/chat/completions might be required…?

Topic		Replies	Views
Unable to get inference results after deploying model to Inferende Endpoints Inference Endpoints on the Hub	0	22	May 8, 2025
Endpoint issue with GPTQ Inference Endpoints on the Hub	0	222	January 23, 2024
Custom image endpoint 404 Inference Endpoints on the Hub	0	300	January 24, 2024
404 Client Error: Not Found for url Models	0	284	September 23, 2021
Creating inference endpoint with custom handler - is this how it should work? Beginners	5	2353	November 27, 2022

Deploying a GGUF model on Inference Endpoints: 404 File Not Found

Related topics