Inference Endpoint Initializing forever!

Inference Endpoint Initializing forever !!!

1 Like

Without knowing the specifics of your inference endpoint setup, I can only offer general advice…
The most common issue is probably related to torch.compile.


Below is a structured, “from first principles” view of:

  • What a dedicated Hugging Face Inference Endpoint is doing while it says Initializing
  • The main causes of “Initializing forever”
  • Concrete solutions / tests for each cause
  • A short decision checklist at the end

I will repeat key ideas in different ways so the mental model is very clear.


1. Background: what “Initializing” means for a dedicated endpoint

A dedicated endpoint is basically:

  • A reserved VM + GPU(s) + a load balancer + a container running your model.

When the status is Initializing, roughly this is happening:

  1. Scheduling / capacity

    • HF tries to reserve the requested GPU in the region and instance type.
    • If there is no capacity, you can see messages like
      Endpoint failed to start. Scheduling failure: not enough hardware capacity in the UI or logs. (Hugging Face Forums)
  2. Container startup

    • HF pulls the image (their standard image or your custom Docker image).

    • HF runs an entrypoint that:

      • Starts the inference runtime (TGI, vLLM, or HF Toolkit), or
      • Runs your own server if you provided a custom image.
  3. Your code runs

    • For Toolkit handlers, this is where EndpointHandler.__init__ loads the model and sets up any resources.
    • For custom images, this is where your server binds to a port and starts listening.
  4. Health checks

    • HF repeatedly probes a health endpoint on a specific port.
    • Only when those checks succeed does the endpoint move to running.
    • If health never goes green (port wrong, app hung, crashes, etc.), it stays in Initializing until HF times it out or marks it failed. The FAQ explicitly says that if logs show your app running but status is stuck at initializing, it is usually a port mapping problem. (Hugging Face)

Typical dedicated endpoint expectations:

  • Docs and forum discussions imply a “normal” cold start is a few minutes (image pull + model load + light warmup). Staying in Initializing for tens of minutes or indefinitely is not normal. (Hugging Face Forums)

So “Initializing forever” always means:

The platform never sees your container as healthy, or it cannot schedule the container at all.


2. Cause 1 – Hardware capacity / scheduling issues (platform-side)

Context / background

HF themselves have several threads where users suddenly cannot start an endpoint that was working before. The UI shows errors like:

HF staff replies often say things like:

  • “We had a minor issue in eu-west-1. This should be fixed.” (Hugging Face Forums)
  • Or “This error is related to GPU availability; change instance or region if possible.” (Hugging Face Forums)

This is a pure infra problem: the requested GPU node type is not available or a quota/cluster issue blocks scheduling.

How it looks when you’re debugging

  • The endpoint sits in Initializing or quickly flips to Failed with a scheduling error.
  • A tiny test endpoint in the same region / instance type can also fail to start.
  • The logs may not even show your code; the failure happens before your container fully starts.

What you can do

  1. Test with a very small “known good” endpoint in the same region + instance type

    • Example: deploy a small official model (e.g. distilbert or gpt2) as a dedicated endpoint.
    • If that also fails or gets stuck, it is a strong signal that capacity / quota is the root cause, not your handler.
  2. Try a different GPU type or region

    • A thread from June 2025 shows HF staff directly telling a user that the fix is to choose another instance/region when seeing scheduling failures. (Hugging Face Forums)
  3. Contact HF support with concrete details

    • HF staff in “Dedicated endpoint stuck at Initializing” and related threads ask for:

    • This is required for them to debug internal logs and quotas.

If this cause is confirmed, your code changes will not fix it; it must be resolved by Hugging Face or by changing the hardware/region.


3. Cause 2 – Load balancer / routing issues (platform-side)

Context / background

Some HF forum topics show endpoints where:

  • The message is:

    • "Load balancer not ready yet" for a long time, or
    • The endpoint stays in a “stuck” state even after the container seems OK. (Hugging Face Forums)
  • Other related posts group this together with “Inference Endpoints Issues” and “Dedicated endpoint stuck at Initializing.” (Hugging Face Forums)

This means the front-end routing / load balancer didn’t become ready, even if the backend container might be running.

How it looks

  • The status remains Initializing or “Load balancer not ready yet” for a long time.
  • Logs for your container may look normal, or you might not see any direct error.
  • Other users sometimes report similar symptoms around the same time, which hints at an incident.

What to do

  • Same strategy as capacity:

    • Confirm a simple endpoint in the same region behaves similarly.
    • Check for threads or status page incidents mentioning load balancer issues in that region.
  • If the problem appears across multiple endpoints, send HF support:

    • Endpoint names, region, timestamps, and any UI error messages like “Load balancer not ready yet.” (Hugging Face Forums)

Again, this is not fixed by editing your handler.py.


4. Cause 3 – Health-check / port mapping issues (very common for dedicated + custom images)

This is one of the most common causes of “Initializing forever” when you use:

  • Custom Docker images, or
  • A Toolkit handler that starts its own HTTP server.

Background from HF docs and forum

The official FAQ has an exact question:

“I can see from the logs that my endpoint is running but the status is stuck at ‘initializing’.”

Answer:

“This usually means that the port mapping is incorrect. Ensure your app is listening on port 80 and that the Docker container is exposing port 80 externally.” (Hugging Face)

A concrete forum case:

  • A user deployed a ComfyUI-based custom image.
  • Inside the container, ComfyUI ran on 127.0.0.1:8188, and logs looked fine.
  • The endpoint stayed in initializing until timeout, with no obvious error. (Hugging Face Forums)
  • The cause fits the FAQ: the app listens on localhost/8188 rather than the expected external port.

Why this breaks dedicated endpoints

The HF control plane:

  • Provisions the VM
  • Starts your container
  • Probes a specific port from outside the container

If your app:

  • Listens only on 127.0.0.1 instead of 0.0.0.0, or
  • Listens on port 8188 while HF probes port 80,

then the health check never succeeds. HF sees “container up, but not healthy → keep initializing”.

Symptoms

  • Logs show your server starting (“Running on http://0.0.0.0:8188”, “ComfyUI started”, etc.).
  • UI still shows Initializing.
  • No clear Python exceptions, no OOM; just silence after the “server started” line.

Solutions

  1. For custom Docker images

    • Ensure your server listens on 0.0.0.0:$PORT, where $PORT is the port HF expects (commonly 80 unless otherwise configured).

    • In Dockerfile:

      • EXPOSE 80
    • In your app:

      • Example: uvicorn app:app --host 0.0.0.0 --port 80
    • Test locally:

      docker run -p 8080:80 your-image
      curl http://localhost:8080/health
      

      to mimic HF’s health check.

  2. For Toolkit / custom handler endpoints

    • Do not start your own web server (Uvicorn, ComfyUI, Gradio) inside handler.py. HF Toolkit expects only:

      class EndpointHandler:
          def __init__(self, path: str = ""):
              # model loading
              ...
          def __call__(self, data):
              # inference
              ...
      
    • HF’s runtime wraps this in the HTTP server and connects it to the health check.

  3. Add structured logging

    • Add log lines like:

      • PHASE_START start_server, PHASE_END start_server.
    • If you see “server started on 127.0.0.1:8188” and nothing else, that points to the port/host issue.

This cause is very specific: logs look “OK”, but the status never leaves Initializing. The FAQ line about port mapping is exactly about this situation. (Hugging Face)


5. Cause 4 – Crash loop while starting (OOM, import errors, misconfig)

Background

Many HF forum topics have messages like:

Typical reasons:

  • The container starts, runs your code, crashes, and is restarted.
  • If this happens repeatedly, you see an endpoint stuck in Initializing or quickly flipping between Initializing and Failed.

Common crash causes

  1. Out-of-memory during model load

    • Large LLM on too-small GPU.
    • Multiple big models loaded in one process (encoder + reranker + LLM etc.).
  2. Missing / incompatible dependencies

    • ModuleNotFoundError for bitsandbytes, accelerate, peft, etc.
    • Mismatched transformers and model files.
  3. Misconfigured inference engines

    • TGI: wrong number of shards or shard size → “Shard 0 failed to start” type errors.
    • vLLM: config issues or unsupported options can crash early.

How to identify this cause

  • In logs, you see repeated stack traces or the same error message over and over.
  • The endpoint never reaches a stable running state.
  • Sometimes the UI shows “Endpoint failed to start” instead of just Initializing. (Hugging Face Forums)

Solutions

  1. Read logs carefully

    • Identify whether you see:

      • CUDA out of memory,
      • Killed (OOM-killer),
      • ModuleNotFoundError,
      • explicit “shard failed” errors.
    • If logs are totally empty, the failure might be before your code (image build, entrypoint, etc.).

  2. Test the exact same image / code locally

    • Build your Docker image locally.

    • Run a minimal test:

      docker run -it your-image python -c "from handler import EndpointHandler; h = EndpointHandler('.'); print('OK')"
      
    • If this fails locally, fix that first.

  3. Reduce memory usage

    • Use torch_dtype=torch.float16 or bfloat16 and device_map="auto" when possible.
    • Test a smaller model size to see if it starts.
    • Temporarily remove auxiliary models (CLIP, reranker, etc.) until the core model works.
  4. Fix library versions

    • Align transformers, accelerate, safetensors, etc., with the versions recommended by the model card or TGI/vLLM docs.

Once you stop the crash loop, the endpoint should move from Initializing to Running relatively quickly.


6. Cause 5 – Startup too slow: big model, downloads, and torch.compile

This is where torch.compile becomes important.

Background: what torch.compile is doing

  • torch.compile is a just-in-time compiler that:

    • Captures your PyTorch graph and generates optimized kernels.
  • Official PyTorch docs say:

    • Cold-start (uncached) compilation typically takes “seconds to minutes” even for common models.
    • Larger models can take 30 minutes or more. (PyTorch Docs)
  • A full example tutorial reports total script time of several minutes when using torch.compile on a non-trivial model. (PyTorch Docs)

So for large LLMs/diffusion models, compile is not cheap. It is an extra heavy step on top of model loading.

Why this breaks a dedicated endpoint

HF gives your container a limited window to:

  • Start
  • Load the model
  • Answer health checks

If in __init__ you do:

self.model = AutoModelForCausalLM.from_pretrained(path, ...)
self.model = torch.compile(self.model, mode="default" or "max-autotune")
# maybe with a warmup loop

then:

  • HF is waiting for health checks during download + load + compile + warmup.
  • If compile takes too long (which is realistic for large models), health checks may time out. HF kills and restarts the pod → you see Initializing forever.
  • If compile crashes (unsupported op, driver issue), you get a crash loop similar to Cause 4.

This is also a known pain point in broader inference platforms: cold-start times become dominated by torch.compile and the first JIT run. (PyTorch Docs)

Safer ways to use torch.compile on a dedicated endpoint

  1. Feature-flag compile (on/off via env variable)

    • Use something like:

      import os
      
      self._enable_compile = os.getenv("ENABLE_TORCH_COMPILE", "0") == "1"
      
    • Deploy once with ENABLE_TORCH_COMPILE=0:

      • If the endpoint initializes, you’ve isolated compile as part of the problem.
    • Later you can re-enable it in a safer manner.

  2. Move compile from __init__ to first real request

    • Keep __init__ as light as possible:

      class EndpointHandler:
          def __init__(self, path=""):
              self.model = AutoModelForCausalLM.from_pretrained(path, ...)
              self._compiled = False
              self._enable_compile = os.getenv("ENABLE_TORCH_COMPILE", "0") == "1"
      
          def __call__(self, data):
              if self._enable_compile and not self._compiled:
                  self.model = torch.compile(self.model, mode="reduce-overhead")
                  self._compiled = True
              # run inference
      
    • Now:

      • Health checks only test that the server is up with a loaded model.
      • The first real request pays the torch.compile cost.
  3. Use appropriate mode (reduce-overhead for small batches)

    • PyTorch docs and HF’s own guide note that mode="reduce-overhead" is intended to reduce Python overhead and uses CUDA graphs; it is often recommended for small-batch inference. (PyTorch Docs)
    • This mode aims to give benefits without too much extra compile complexity.
  4. Compile only the hot part of the model

    • For LLMs: compile the decoding block or a smaller wrapper instead of the entire pipeline.
    • For diffusion: compile UNet rather than the whole pipeline.
  5. Minimize warmup during startup

    • Documentation and blogs recommend warmup after compile to hide latency from users, but for endpoints you can:

      • Do minimal warmup inside the first request.
      • Or send a single “warmup request” after the endpoint reaches running, instead of doing a big warmup inside __init__. (PyTorch Docs)
  6. Accept that compile may not be worth it for some endpoints

    • Benchmarks and “pitfall” articles show that for small-batch, latency-sensitive workloads, torch.compile sometimes adds more overhead than benefit. (Medium)

    • If you’re doing low-throughput API serving, you may get more predictable wins from:

      • Smaller models
      • Quantization (e.g. 4-bit/8-bit)
      • Better hardware selection
      • Simple caching and batching

If you rearrange startup so that heavy compile and warmup are not part of the health-check window, you greatly reduce the chance of “Initializing forever”.


7. Putting it together: practical diagnosis flow

Here is a concise decision tree you can apply to your dedicated endpoint.

Step 1 – Quick baseline

  • Create a small, vanilla dedicated endpoint in the same region + GPU type.

  • If that also fails or gets stuck:

    • Suspect capacity or routing (Cause 1 or 2).
    • Try another GPU/region and/or contact HF with endpoint details. (Hugging Face Forums)

Step 2 – Examine logs

  • If logs are empty or the log panel 5xx’s:

    • Again, suspect infra; gather endpoint name/region/timestamps and escalate. (Hugging Face Forums)
  • If logs show your server running (“listening on 127.0.0.1:8188”, “server started”) but the status is still Initializing:

    • Very likely a port/host mismatch (Cause 3). Fix to 0.0.0.0:$PORT, ensure Docker EXPOSE matches, and don’t run your own server for Toolkit handlers. (Hugging Face)
  • If logs show repeated Python exceptions or OOMs:

    • You have a crash loop (Cause 4). Solve OOM/import/config issues first.
  • If logs show model loading messages and then nothing for a long time:

    • Likely heavy startup work (Cause 5), often involving torch.compile, large downloads, or huge warmups.

Step 3 – Instrument phases (optional but very helpful)

Add explicit phase logs in __init__:

  • PHASE_START init
  • PHASE_START load_model
  • PHASE_START maybe_compile
  • PHASE_END ...

Then:

  • If logs stop before PHASE_START maybe_compile, compile is probably not the problem; earlier steps are.
  • If logs stop at PHASE_START maybe_compile, or you never see PHASE_END maybe_compile, that strongly implicates torch.compile.

Step 4 – Toggle torch.compile and heavy work

  • Disable compile and extra warmup (via env flags) and redeploy:

    • If the endpoint finally reaches running, then reorganize how and when you compile.
  • If disabling compile doesn’t help, focus back on capacity, port mapping, and crash loops.


8. Summary: main causes and matching solutions

To make it extremely explicit:

  1. Capacity / scheduling issues (HF infra)

    • Symptom: “Endpoint failed to start. Scheduling failure: not enough hardware capacity.” (Hugging Face Forums)
    • Fix: Try another region / instance; contact HF with endpoint details.
  2. Load balancer / routing issues (HF infra)

    • Symptom: “Load balancer not ready yet” for a long time, multiple users affected. (Hugging Face Forums)
    • Fix: Treat as an incident; confirm with small endpoints; escalate to HF.
  3. Health-check / port mapping problems (custom images / servers)

    • Symptom: Logs show your app running, but status never leaves Initializing. (Hugging Face)
    • Fix: Ensure server listens on 0.0.0.0 on the expected port and exposes that port; do not start your own server inside Toolkit handlers.
  4. Crash loop during startup (OOM, dependency, config errors)

    • Symptom: Repeated stack traces or errors; sometimes explicit “Endpoint failed to start.” (Hugging Face Forums)
    • Fix: Read logs; adjust model size/dtype; fix imports and TGI/vLLM config; test image locally.
  5. Startup too slow due to heavy work and torch.compile

    • Symptom: Logs show long-running initialization, possibly with no error, and endpoint never becomes healthy; disabling compile or reducing work suddenly lets it start.

    • Background: torch.compile cold-start time is “seconds to minutes” for common models, and much longer for large models. (PyTorch Docs)

    • Fix:

      • Guard compile with env flags.
      • Move compile to the first real request instead of __init__.
      • Use mode="reduce-overhead" and compile only hot paths.
      • Limit warmup in the health-check window.

Hi @omarzouk55! This can happen if there’s low availability of the hardware selected and it’s not available to use just yet. If you’d like to get started right away, you can try another instance. As soon as an instance is available, your Endpoint will be up and running :rocket:

1 Like