Cannot access the Inference Provider's service

Ramziiiiii · November 20, 2025, 9:38am

I have a Pro subscription, and I am trying to access the Wan-AI/Wan2.2-TI2V-5B model (or tencent/HunyuanVideo) via an asynchronous POST request to https://router.huggingface.co/hf-inference using my Bearer token. I am consistently receiving a 404 Not Found error, which suggests a routing or mapping issue on the server’s end."

John6666 · November 21, 2025, 12:25am

Usage of Inference Providers have some quirks.

The 404 is expected. You are calling the wrong endpoint and the wrong provider for those models. The router is behaving as designed; it is not a mapping bug on Hugging Face’s side.

I will break this into:

What your current call is doing.
How Hugging Face’s router + providers actually work.
Why Wan-AI/Wan2.2-TI2V-5B and tencent/HunyuanVideo 404 on hf-inference.
How to call these models correctly (sync + async).
How to keep your “asynchronous POST” architecture without fighting the API.

1. What you are doing now

You described:

You have a Pro subscription.

You send an asynchronous POST (from your own code) to:

POST https://router.huggingface.co/hf-inference
Authorization: Bearer hf_...
Content-Type: application/json

You get 404 Not Found consistently.

Two key problems at once:

https://router.huggingface.co/hf-inference by itself is not a valid inference route; it is just a prefix.

For HF Inference, official examples use a pattern like:

curl https://router.huggingface.co/hf-inference/models/facebook/bart-large-mnli \
  -H "Authorization: Bearer $HF_TOKEN" \
  -d '{"inputs": "...", "parameters": {...}}'

The models you want, Wan-AI/Wan2.2-TI2V-5B and tencent/HunyuanVideo, are not hosted by the hf-inference provider. They’re hosted by third-party Inference Providers such as Fal, WaveSpeedAI, Novita, etc. (Hugging Face)

So even if you “fixed” the path to:

POST https://router.huggingface.co/hf-inference/models/Wan-AI/Wan2.2-TI2V-5B

you would still get a 404, because there is no HF-Inference deployment for that model.

2. Background: router + providers (why `/hf-inference` often 404s)

2.1 Three layers in the current Hugging Face stack

Modern Hugging Face inference is structured like this:

Router (https://router.huggingface.co)
- A “switchboard” that forwards requests to different backends called Inference Providers.
- It is used by the Python InferenceClient, the JS client, and OpenAI-style /v1/... APIs. (Hugging Face)
Inference Providers (Fal, WaveSpeedAI, Novita, Together, HF Inference, etc.)
- Each provider supports particular tasks (chat, text-to-image, text-to-video…).
- The providers table shows, for example, that Fal, Novita and WaveSpeed support Text to Video. HF Inference also supports text-to-video as a task, but not for every model. (Hugging Face)
HF Inference
- This is just one provider in that table (“HF Inference”).
- It replaces the old api-inference.huggingface.co, but it only hosts a limited set of “warm” models. Many hub models are only available via third-party providers, not via HF Inference. (Hugging Face Forums)

2.2 What `/hf-inference` actually means

https://router.huggingface.co/hf-inference/... = “send this request to the HF Inference provider.”

That path is only valid in specific shapes, for example:

/hf-inference/models/<model-id>
/hf-inference/models/<model-id>/v1/chat/completions

Calling the bare root /hf-inference will always 404; it is not a complete resource path.

So conceptually:

Router = front door.
/hf-inference = “use the HF Inference backend, not Fal/Novita/etc.”
404 here usually means “this path/model does not exist on HF Inference.”

3. Where Wan2.2 and HunyuanVideo actually live

3.1 Evidence from provider and model listings

From the Fal and WaveSpeed provider/model lists: (Hugging Face)

Wan-AI/Wan2.2-TI2V-5B is explicitly listed as a Text-to-Video model available via Fal and WaveSpeedAI (and other providers).
tencent/HunyuanVideo is also listed as Text-to-Video on Fal and appears in the text-to-video model index with multiple providers. (Hugging Face)

From the WaveSpeed provider doc: (Hugging Face)

import os
from huggingface_hub import InferenceClient

client = InferenceClient(
    provider="wavespeed",
    api_key=os.environ["HF_TOKEN"],
)

video = client.text_to_video(
    "A young man walking on the street",
    model="Wan-AI/Wan2.2-TI2V-5B",
)

From the Fal/HunyuanVideo ecosystem: (fal.ai)

Fal advertises fal-ai/hunyuan-video as a text-to-video endpoint.
The Hugging Face model listings show tencent/HunyuanVideo as a text-to-video model available via Fal and other providers.

The important point: all the official code examples for these models use provider="fal-ai" or provider="wavespeed" (or similar), never provider="hf-inference".

So when you target https://router.huggingface.co/hf-inference, you are explicitly asking to use the wrong provider for these models.

4. Why your specific request returns 404

Putting it together:

Invalid path shape
- POST https://router.huggingface.co/hf-inference → no /models/<id> and no task suffix.
- That path is not defined in the HF router; 404 is the correct HTTP response.
Wrong provider for these models
- Wan-AI/Wan2.2-TI2V-5B and tencent/HunyuanVideo are deployed on Fal/WaveSpeed/Novita, not on HF Inference. (Hugging Face)
- Even if you send:
```
POST https://router.huggingface.co/hf-inference/models/Wan-AI/Wan2.2-TI2V-5B
```
  HF Inference has no such deployment → 404.
This is a documented migration pattern
- Hugging Face forum threads about the old api-inference endpoint show people hitting router.huggingface.co/hf-inference and seeing 404 until they switch to proper Inference Providers usage (InferenceClient with provider="..."). (Hugging Face Forums)

So your 404 is basically:

“The HF Inference provider does not know about this path/model. Please call a supported provider or use the documented client.”

It is not a bug or a misrouting on HF’s side; it’s a mismatch between your chosen endpoint/provider and where those models actually run.

5. How to call these models correctly

5.1 Recommended: Python `InferenceClient.text_to_video(...)`

The simplest and most stable approach is to stop hand-crafting HTTP calls and use the official client, which already knows how to talk to the router and providers. (Hugging Face)

Example for Wan2.2 (WaveSpeed)

import os
from huggingface_hub import InferenceClient

HF_TOKEN = os.environ["HF_TOKEN"]

client = InferenceClient(
    provider="wavespeed",  # or "fal-ai", "novita", etc. — all listed in the docs
    api_key=HF_TOKEN,
)

video_bytes = client.text_to_video(
    "A cinematic shot of a city at sunset, drone view",
    model="Wan-AI/Wan2.2-TI2V-5B",
)

with open("wan22.mp4", "wb") as f:
    f.write(video_bytes)

This is almost exactly the snippet in the WaveSpeed provider docs, just with a different prompt. (Hugging Face)

Example for HunyuanVideo (Fal)

import os
from huggingface_hub import InferenceClient

HF_TOKEN = os.environ["HF_TOKEN"]

client = InferenceClient(
    provider="fal-ai",   # HunyuanVideo is available via Fal and other providers
    api_key=HF_TOKEN,
)

video_bytes = client.text_to_video(
    "A dragon flying above mountains in stormy weather",
    model="tencent/HunyuanVideo",
)

with open("hunyuan.mp4", "wb") as f:
    f.write(video_bytes)

HunyuanVideo is listed as text-to-video on Fal and in the HF model index; this pattern follows the official Text-to-Video and provider docs. (fal.ai)

Using automatic provider selection

On current huggingface_hub versions, you can also let HF pick the provider:

client = InferenceClient(
    provider="auto",     # or just omit provider; HF chooses based on model + task
    api_key=HF_TOKEN,
)

video_bytes = client.text_to_video(
    "A close-up of waves crashing on rocks, slow motion",
    model="Wan-AI/Wan2.2-TI2V-5B",
)

Internally, this still goes through https://router.huggingface.co, but you no longer worry about /hf-inference vs Fal vs WaveSpeed; the client handles routing.

5.2 Keeping your code “asynchronous” safely

The Hugging Face Text-to-Video API itself is synchronous: one HTTP request in, video bytes back when ready. The “asynchronous” part should live in your own application layer.

A practical pattern in Python:

import asyncio
from huggingface_hub import InferenceClient
import os

HF_TOKEN = os.environ["HF_TOKEN"]

client = InferenceClient(
    provider="wavespeed",
    api_key=HF_TOKEN,
)

async def generate_video_async(prompt: str) -> bytes:
    # Run blocking call in a worker thread so your event loop is not blocked
    return await asyncio.to_thread(
        client.text_to_video,
        prompt,
        "Wan-AI/Wan2.2-TI2V-5B",
    )

# Example usage:
# video = await generate_video_async("A robot walking in a neon-lit alley")

You can also put this call behind a job queue:

Your frontend sends an async POST to your own /video_jobs endpoint.
Your backend enqueues a job.
A worker process calls client.text_to_video(...) (blocking is fine here).
The worker stores the video and marks the job as done.
Your frontend polls or subscribes to a status endpoint.

Key idea: do not try to make https://router.huggingface.co/hf-inference itself behave like an async job queue. Treat HF’s call as a long-running synchronous operation, and build your async behavior around it.

5.3 If you absolutely must use raw HTTP

For languages with no HF client, you still should avoid guessing /hf-inference paths.

Options:

Use Hugging Face’s router via InferenceClient in a small sidecar service (Python/Node), and talk HTTP/GRPC to that sidecar from your main stack.
Or call provider APIs directly (Fal, WaveSpeed, Novita) using their documented HTTP endpoints and keys. For example, Fal exposes fal-ai/hunyuan-video with its own JSON schema and endpoint. (fal.ai)

Both are more stable than reverse-engineering router URLs, which can change as HF evolves the platform.

6. How to map your current code to the correct design

You can think of your migration in three concrete steps:

Stop calling
```
POST https://router.huggingface.co/hf-inference
```
That route is incomplete and points to the wrong provider for these models.
Switch to the documented Text-to-Video client
- Python/Node: InferenceClient(..., provider="wavespeed" | "fal-ai" | "novita" | "auto") + text_to_video(..., model="Wan-AI/Wan2.2-TI2V-5B" or "tencent/HunyuanVideo"). (Hugging Face)
Wrap that call in your own async job pattern
- Background workers / asyncio.to_thread / queue.
- Your client still does an “asynchronous POST”, but that POST is to your backend, not directly to /hf-inference.

7. Minimal, concrete “fix” summary

The 404 is expected because:
- /hf-inference with no /models/... is not a valid endpoint. (Hugging Face)
- Wan-AI/Wan2.2-TI2V-5B and tencent/HunyuanVideo are served by providers like Fal and WaveSpeed, not by HF Inference. (Hugging Face)

Use the official text-to-video client instead of manual POSTs:

from huggingface_hub import InferenceClient
import os

HF_TOKEN = os.environ["HF_TOKEN"]

client = InferenceClient(
    provider="wavespeed",  # or "fal-ai", "novita", "auto"
    api_key=HF_TOKEN,
)

video = client.text_to_video(
    "A young man walking on the street",
    model="Wan-AI/Wan2.2-TI2V-5B",
)

Keep your application asynchronous by wrapping this call in a background worker or asyncio.to_thread, rather than trying to make /hf-inference itself into an async job endpoint.

Topic		Replies	Views
HF Inference API last few minutes returns the same 404 exception to all models Inference Endpoints on the Hub	45	2432	June 25, 2025
Inference API returns Unkown Error 🤗Hub	1	668	November 15, 2021
Inference API stopped working Inference Endpoints on the Hub	50	6280	June 8, 2025
API returns "Not Found" / "Invalid Credentials" for any key from new verified accounts Inference Endpoints on the Hub	1	88	July 20, 2025
Disable Hosted inference API 🤗Hub	4	1802	September 30, 2021