Dedicated CPU Inference Endpoint returns empty HTTP 500 after ~80s: is there a configurable request timeout?


Environment

  • Product: Dedicated private Inference Endpoint (CPU, not serverless)
  • Region: eu-west-1
  • Framework: custom EndpointHandler (Python, SimpleITK)
  • Client: httpx with a 600s timeout

Problem

Requests to our dedicated CPU endpoint occasionally return an HTTP 500 with a completely empty response body. This only happens for computationally expensive requests; lighter requests on the same endpoint complete successfully.

Our handler wraps all processing in a try/except that returns a structured JSON error on any Python exception. An empty 500 means the container process was killed before Python could write a response.


What we investigated

Step 1 - ruled out OOM
We added per-step INFO logging inside the handler. The HF application logs show the container reaching the expensive computation step, then going silent. Monitoring the endpoint metrics shows CPU spiking to 100% and staying there, but memory remaining well under 500 MB (limit is 2 GB). OOM does not look like the cause.I

Step 2 - ruled out client-side timeout
Our httpx client timeout is set to 600s. The 500 was received at ~165s, so the client timeout never fired. The error came from the server side.

Step 3 - SLEEP_TEST experiment
To isolate whether this is a wall-clock timeout imposed by HF infrastructure (rather than anything specific to our computation), we replaced the real processing with a simple sleep loop that logs a heartbeat every 10 seconds:

elapsed = 0
while elapsed < 240:
time.sleep(10)
elapsed += 10
logger.info("SLEEP_TEST: %ds / 240s elapsed", elapsed)
return {"shape": \[256, 256, 50\]}

This was enabled via a SLEEP_TEST environment variable on the endpoint. Result:
the last log line we received was SLEEP_TEST: 80s / 240s elapsed. A 500 with
empty body was returned immediately after. The endpoint never reached the 90s
heartbeat.

This confirms a hard wall-clock timeout of approximately 80–90 seconds imposed
by the infrastructure, unrelated to our code or the specific computation being
performed.

Step 4 - verified the computation itself is not broken
We ran the same processing locally and it completed successfully in ~120s.


Question

Is there a configurable request timeout on dedicated CPU Inference Endpoints?
The ~80–90s hard kill appears to be imposed by a gateway or proxy layer (the
container process receives no signal we can intercept: there is no Python
exception, no SIGTERM handler triggered).

If this limit is fixed for the CPU tier, is there a higher-tier option or
configuration that supports longer-running synchronous requests? Alternatively,
is there a recommended pattern for CPU-bound tasks that exceed this duration
(e.g. polling, async task queues)?

Hi @Jose-Verdu-Diaz ! You can email api-enterprise@huggingface.co and we’ll be happy to investigate this issue for you. Thank you!

Thanks for the detailed investigation! Based on your findings, here are a few things to check that might resolve the 80s timeout:

  1. Adjust Request Timeout Settings:

    • In the huggingface_hub library, ensure the timeout parameter is set higher than 80s when initializing the InferenceApi client.
    • Example: InferenceApi(repo_id=..., timeout=120)
  2. Verify Container Resource Limits:

    • Although memory usage is low, confirm if the CPU cores allocated are sufficient for your workload. Sometimes CPU throttling can cause unexpected halts.
  3. Check Server-side Logs:

    • If possible, enable DEBUG level logs on the endpoint side to see if there is a silent exception being caught that isn’t visible in the standard 500 error message.

Hope this helps fix the empty 500 response issue!