Inference Endpoint Initializing forever !!!
Without knowing the specifics of your inference endpoint setup, I can only offer general adviceâŚ
The most common issue is probably related to torch.compile.
Below is a structured, âfrom first principlesâ view of:
- What a dedicated Hugging Face Inference Endpoint is doing while it says
Initializing - The main causes of âInitializing foreverâ
- Concrete solutions / tests for each cause
- A short decision checklist at the end
I will repeat key ideas in different ways so the mental model is very clear.
1. Background: what âInitializingâ means for a dedicated endpoint
A dedicated endpoint is basically:
- A reserved VM + GPU(s) + a load balancer + a container running your model.
When the status is Initializing, roughly this is happening:
-
Scheduling / capacity
- HF tries to reserve the requested GPU in the region and instance type.
- If there is no capacity, you can see messages like
Endpoint failed to start. Scheduling failure: not enough hardware capacityin the UI or logs. (Hugging Face Forums)
-
Container startup
-
HF pulls the image (their standard image or your custom Docker image).
-
HF runs an entrypoint that:
- Starts the inference runtime (TGI, vLLM, or HF Toolkit), or
- Runs your own server if you provided a custom image.
-
-
Your code runs
- For Toolkit handlers, this is where
EndpointHandler.__init__loads the model and sets up any resources. - For custom images, this is where your server binds to a port and starts listening.
- For Toolkit handlers, this is where
-
Health checks
- HF repeatedly probes a health endpoint on a specific port.
- Only when those checks succeed does the endpoint move to
running. - If health never goes green (port wrong, app hung, crashes, etc.), it stays in
Initializinguntil HF times it out or marks itfailed. The FAQ explicitly says that if logs show your app running but status is stuck atinitializing, it is usually a port mapping problem. (Hugging Face)
Typical dedicated endpoint expectations:
- Docs and forum discussions imply a ânormalâ cold start is a few minutes (image pull + model load + light warmup). Staying in
Initializingfor tens of minutes or indefinitely is not normal. (Hugging Face Forums)
So âInitializing foreverâ always means:
The platform never sees your container as healthy, or it cannot schedule the container at all.
2. Cause 1 â Hardware capacity / scheduling issues (platform-side)
Context / background
HF themselves have several threads where users suddenly cannot start an endpoint that was working before. The UI shows errors like:
Endpoint failed to start. Scheduling failure: not enough hardware capacity(Hugging Face Forums)- Or similar âunable to scheduleâ messages. (Hugging Face Forums)
HF staff replies often say things like:
- âWe had a minor issue in eu-west-1. This should be fixed.â (Hugging Face Forums)
- Or âThis error is related to GPU availability; change instance or region if possible.â (Hugging Face Forums)
This is a pure infra problem: the requested GPU node type is not available or a quota/cluster issue blocks scheduling.
How it looks when youâre debugging
- The endpoint sits in
Initializingor quickly flips toFailedwith a scheduling error. - A tiny test endpoint in the same region / instance type can also fail to start.
- The logs may not even show your code; the failure happens before your container fully starts.
What you can do
-
Test with a very small âknown goodâ endpoint in the same region + instance type
- Example: deploy a small official model (e.g.
distilbertorgpt2) as a dedicated endpoint. - If that also fails or gets stuck, it is a strong signal that capacity / quota is the root cause, not your handler.
- Example: deploy a small official model (e.g.
-
Try a different GPU type or region
- A thread from June 2025 shows HF staff directly telling a user that the fix is to choose another instance/region when seeing scheduling failures. (Hugging Face Forums)
-
Contact HF support with concrete details
-
HF staff in âDedicated endpoint stuck at Initializingâ and related threads ask for:
- Endpoint name
- Region
- Instance type
- Time window of the failed attempts (Hugging Face Forums)
-
This is required for them to debug internal logs and quotas.
-
If this cause is confirmed, your code changes will not fix it; it must be resolved by Hugging Face or by changing the hardware/region.
3. Cause 2 â Load balancer / routing issues (platform-side)
Context / background
Some HF forum topics show endpoints where:
-
The message is:
"Load balancer not ready yet"for a long time, or- The endpoint stays in a âstuckâ state even after the container seems OK. (Hugging Face Forums)
-
Other related posts group this together with âInference Endpoints Issuesâ and âDedicated endpoint stuck at Initializing.â (Hugging Face Forums)
This means the front-end routing / load balancer didnât become ready, even if the backend container might be running.
How it looks
- The status remains
Initializingor âLoad balancer not ready yetâ for a long time. - Logs for your container may look normal, or you might not see any direct error.
- Other users sometimes report similar symptoms around the same time, which hints at an incident.
What to do
-
Same strategy as capacity:
- Confirm a simple endpoint in the same region behaves similarly.
- Check for threads or status page incidents mentioning load balancer issues in that region.
-
If the problem appears across multiple endpoints, send HF support:
- Endpoint names, region, timestamps, and any UI error messages like âLoad balancer not ready yet.â (Hugging Face Forums)
Again, this is not fixed by editing your handler.py.
4. Cause 3 â Health-check / port mapping issues (very common for dedicated + custom images)
This is one of the most common causes of âInitializing foreverâ when you use:
- Custom Docker images, or
- A Toolkit handler that starts its own HTTP server.
Background from HF docs and forum
The official FAQ has an exact question:
âI can see from the logs that my endpoint is running but the status is stuck at âinitializingâ.â
Answer:
âThis usually means that the port mapping is incorrect. Ensure your app is listening on port 80 and that the Docker container is exposing port 80 externally.â (Hugging Face)
A concrete forum case:
- A user deployed a ComfyUI-based custom image.
- Inside the container, ComfyUI ran on
127.0.0.1:8188, and logs looked fine. - The endpoint stayed in
initializinguntil timeout, with no obvious error. (Hugging Face Forums) - The cause fits the FAQ: the app listens on localhost/8188 rather than the expected external port.
Why this breaks dedicated endpoints
The HF control plane:
- Provisions the VM
- Starts your container
- Probes a specific port from outside the container
If your app:
- Listens only on
127.0.0.1instead of0.0.0.0, or - Listens on port
8188while HF probes port80,
then the health check never succeeds. HF sees âcontainer up, but not healthy â keep initializingâ.
Symptoms
- Logs show your server starting (âRunning on http://0.0.0.0:8188â, âComfyUI startedâ, etc.).
- UI still shows
Initializing. - No clear Python exceptions, no OOM; just silence after the âserver startedâ line.
Solutions
-
For custom Docker images
-
Ensure your server listens on
0.0.0.0:$PORT, where$PORTis the port HF expects (commonly80unless otherwise configured). -
In Dockerfile:
EXPOSE 80
-
In your app:
- Example:
uvicorn app:app --host 0.0.0.0 --port 80
- Example:
-
Test locally:
docker run -p 8080:80 your-image curl http://localhost:8080/healthto mimic HFâs health check.
-
-
For Toolkit / custom handler endpoints
-
Do not start your own web server (Uvicorn, ComfyUI, Gradio) inside
handler.py. HF Toolkit expects only:class EndpointHandler: def __init__(self, path: str = ""): # model loading ... def __call__(self, data): # inference ... -
HFâs runtime wraps this in the HTTP server and connects it to the health check.
-
-
Add structured logging
-
Add log lines like:
PHASE_START start_server,PHASE_END start_server.
-
If you see âserver started on 127.0.0.1:8188â and nothing else, that points to the port/host issue.
-
This cause is very specific: logs look âOKâ, but the status never leaves Initializing. The FAQ line about port mapping is exactly about this situation. (Hugging Face)
5. Cause 4 â Crash loop while starting (OOM, import errors, misconfig)
Background
Many HF forum topics have messages like:
Server message: Endpoint failed to start. Endpoint failed(Hugging Face Forums)Inference Endpoint Fails to Startand similar. (Hugging Face Forums)
Typical reasons:
- The container starts, runs your code, crashes, and is restarted.
- If this happens repeatedly, you see an endpoint stuck in
Initializingor quickly flipping betweenInitializingandFailed.
Common crash causes
-
Out-of-memory during model load
- Large LLM on too-small GPU.
- Multiple big models loaded in one process (encoder + reranker + LLM etc.).
-
Missing / incompatible dependencies
ModuleNotFoundErrorforbitsandbytes,accelerate,peft, etc.- Mismatched
transformersand model files.
-
Misconfigured inference engines
- TGI: wrong number of shards or shard size â âShard 0 failed to startâ type errors.
- vLLM: config issues or unsupported options can crash early.
How to identify this cause
- In logs, you see repeated stack traces or the same error message over and over.
- The endpoint never reaches a stable
runningstate. - Sometimes the UI shows âEndpoint failed to startâ instead of just
Initializing. (Hugging Face Forums)
Solutions
-
Read logs carefully
-
Identify whether you see:
CUDA out of memory,Killed(OOM-killer),ModuleNotFoundError,- explicit âshard failedâ errors.
-
If logs are totally empty, the failure might be before your code (image build, entrypoint, etc.).
-
-
Test the exact same image / code locally
-
Build your Docker image locally.
-
Run a minimal test:
docker run -it your-image python -c "from handler import EndpointHandler; h = EndpointHandler('.'); print('OK')" -
If this fails locally, fix that first.
-
-
Reduce memory usage
- Use
torch_dtype=torch.float16orbfloat16anddevice_map="auto"when possible. - Test a smaller model size to see if it starts.
- Temporarily remove auxiliary models (CLIP, reranker, etc.) until the core model works.
- Use
-
Fix library versions
- Align
transformers,accelerate,safetensors, etc., with the versions recommended by the model card or TGI/vLLM docs.
- Align
Once you stop the crash loop, the endpoint should move from Initializing to Running relatively quickly.
6. Cause 5 â Startup too slow: big model, downloads, and torch.compile
This is where torch.compile becomes important.
Background: what torch.compile is doing
-
torch.compileis a just-in-time compiler that:- Captures your PyTorch graph and generates optimized kernels.
-
Official PyTorch docs say:
- Cold-start (uncached) compilation typically takes âseconds to minutesâ even for common models.
- Larger models can take 30 minutes or more. (PyTorch Docs)
-
A full example tutorial reports total script time of several minutes when using
torch.compileon a non-trivial model. (PyTorch Docs)
So for large LLMs/diffusion models, compile is not cheap. It is an extra heavy step on top of model loading.
Why this breaks a dedicated endpoint
HF gives your container a limited window to:
- Start
- Load the model
- Answer health checks
If in __init__ you do:
self.model = AutoModelForCausalLM.from_pretrained(path, ...)
self.model = torch.compile(self.model, mode="default" or "max-autotune")
# maybe with a warmup loop
then:
- HF is waiting for health checks during download + load + compile + warmup.
- If compile takes too long (which is realistic for large models), health checks may time out. HF kills and restarts the pod â you see
Initializingforever. - If compile crashes (unsupported op, driver issue), you get a crash loop similar to Cause 4.
This is also a known pain point in broader inference platforms: cold-start times become dominated by torch.compile and the first JIT run. (PyTorch Docs)
Safer ways to use torch.compile on a dedicated endpoint
-
Feature-flag compile (on/off via env variable)
-
Use something like:
import os self._enable_compile = os.getenv("ENABLE_TORCH_COMPILE", "0") == "1" -
Deploy once with
ENABLE_TORCH_COMPILE=0:- If the endpoint initializes, youâve isolated compile as part of the problem.
-
Later you can re-enable it in a safer manner.
-
-
Move compile from
__init__to first real request-
Keep
__init__as light as possible:class EndpointHandler: def __init__(self, path=""): self.model = AutoModelForCausalLM.from_pretrained(path, ...) self._compiled = False self._enable_compile = os.getenv("ENABLE_TORCH_COMPILE", "0") == "1" def __call__(self, data): if self._enable_compile and not self._compiled: self.model = torch.compile(self.model, mode="reduce-overhead") self._compiled = True # run inference -
Now:
- Health checks only test that the server is up with a loaded model.
- The first real request pays the
torch.compilecost.
-
-
Use appropriate mode (
reduce-overheadfor small batches)- PyTorch docs and HFâs own guide note that
mode="reduce-overhead"is intended to reduce Python overhead and uses CUDA graphs; it is often recommended for small-batch inference. (PyTorch Docs) - This mode aims to give benefits without too much extra compile complexity.
- PyTorch docs and HFâs own guide note that
-
Compile only the hot part of the model
- For LLMs: compile the decoding block or a smaller wrapper instead of the entire pipeline.
- For diffusion: compile UNet rather than the whole pipeline.
-
Minimize warmup during startup
-
Documentation and blogs recommend warmup after compile to hide latency from users, but for endpoints you can:
- Do minimal warmup inside the first request.
- Or send a single âwarmup requestâ after the endpoint reaches
running, instead of doing a big warmup inside__init__. (PyTorch Docs)
-
-
Accept that compile may not be worth it for some endpoints
-
Benchmarks and âpitfallâ articles show that for small-batch, latency-sensitive workloads,
torch.compilesometimes adds more overhead than benefit. (Medium) -
If youâre doing low-throughput API serving, you may get more predictable wins from:
- Smaller models
- Quantization (e.g. 4-bit/8-bit)
- Better hardware selection
- Simple caching and batching
-
If you rearrange startup so that heavy compile and warmup are not part of the health-check window, you greatly reduce the chance of âInitializing foreverâ.
7. Putting it together: practical diagnosis flow
Here is a concise decision tree you can apply to your dedicated endpoint.
Step 1 â Quick baseline
-
Create a small, vanilla dedicated endpoint in the same region + GPU type.
-
If that also fails or gets stuck:
- Suspect capacity or routing (Cause 1 or 2).
- Try another GPU/region and/or contact HF with endpoint details. (Hugging Face Forums)
Step 2 â Examine logs
-
If logs are empty or the log panel 5xxâs:
- Again, suspect infra; gather endpoint name/region/timestamps and escalate. (Hugging Face Forums)
-
If logs show your server running (âlistening on 127.0.0.1:8188â, âserver startedâ) but the status is still
Initializing:- Very likely a port/host mismatch (Cause 3). Fix to
0.0.0.0:$PORT, ensure DockerEXPOSEmatches, and donât run your own server for Toolkit handlers. (Hugging Face)
- Very likely a port/host mismatch (Cause 3). Fix to
-
If logs show repeated Python exceptions or OOMs:
- You have a crash loop (Cause 4). Solve OOM/import/config issues first.
-
If logs show model loading messages and then nothing for a long time:
- Likely heavy startup work (Cause 5), often involving
torch.compile, large downloads, or huge warmups.
- Likely heavy startup work (Cause 5), often involving
Step 3 â Instrument phases (optional but very helpful)
Add explicit phase logs in __init__:
PHASE_START initPHASE_START load_modelPHASE_START maybe_compilePHASE_END ...
Then:
- If logs stop before
PHASE_START maybe_compile, compile is probably not the problem; earlier steps are. - If logs stop at
PHASE_START maybe_compile, or you never seePHASE_END maybe_compile, that strongly implicatestorch.compile.
Step 4 â Toggle torch.compile and heavy work
-
Disable compile and extra warmup (via env flags) and redeploy:
- If the endpoint finally reaches
running, then reorganize how and when you compile.
- If the endpoint finally reaches
-
If disabling compile doesnât help, focus back on capacity, port mapping, and crash loops.
8. Summary: main causes and matching solutions
To make it extremely explicit:
-
Capacity / scheduling issues (HF infra)
- Symptom: âEndpoint failed to start. Scheduling failure: not enough hardware capacity.â (Hugging Face Forums)
- Fix: Try another region / instance; contact HF with endpoint details.
-
Load balancer / routing issues (HF infra)
- Symptom: âLoad balancer not ready yetâ for a long time, multiple users affected. (Hugging Face Forums)
- Fix: Treat as an incident; confirm with small endpoints; escalate to HF.
-
Health-check / port mapping problems (custom images / servers)
- Symptom: Logs show your app running, but status never leaves
Initializing. (Hugging Face) - Fix: Ensure server listens on
0.0.0.0on the expected port and exposes that port; do not start your own server inside Toolkit handlers.
- Symptom: Logs show your app running, but status never leaves
-
Crash loop during startup (OOM, dependency, config errors)
- Symptom: Repeated stack traces or errors; sometimes explicit âEndpoint failed to start.â (Hugging Face Forums)
- Fix: Read logs; adjust model size/dtype; fix imports and TGI/vLLM config; test image locally.
-
Startup too slow due to heavy work and
torch.compile-
Symptom: Logs show long-running initialization, possibly with no error, and endpoint never becomes healthy; disabling compile or reducing work suddenly lets it start.
-
Background:
torch.compilecold-start time is âseconds to minutesâ for common models, and much longer for large models. (PyTorch Docs) -
Fix:
- Guard compile with env flags.
- Move compile to the first real request instead of
__init__. - Use
mode="reduce-overhead"and compile only hot paths. - Limit warmup in the health-check window.
-
Hi @omarzouk55! This can happen if thereâs low availability of the hardware selected and itâs not available to use just yet. If youâd like to get started right away, you can try another instance. As soon as an instance is available, your Endpoint will be up and running ![]()