Environment
- Product: Dedicated private Inference Endpoint (CPU, not serverless)
- Region: eu-west-1
- Framework: custom EndpointHandler (Python, SimpleITK)
- Client: httpx with a 600s timeout
Problem
Requests to our dedicated CPU endpoint occasionally return an HTTP 500 with a completely empty response body. This only happens for computationally expensive requests; lighter requests on the same endpoint complete successfully.
Our handler wraps all processing in a try/except that returns a structured JSON error on any Python exception. An empty 500 means the container process was killed before Python could write a response.
What we investigated
Step 1 - ruled out OOM
We added per-step INFO logging inside the handler. The HF application logs show the container reaching the expensive computation step, then going silent. Monitoring the endpoint metrics shows CPU spiking to 100% and staying there, but memory remaining well under 500 MB (limit is 2 GB). OOM does not look like the cause.I
Step 2 - ruled out client-side timeout
Our httpx client timeout is set to 600s. The 500 was received at ~165s, so the client timeout never fired. The error came from the server side.
Step 3 - SLEEP_TEST experiment
To isolate whether this is a wall-clock timeout imposed by HF infrastructure (rather than anything specific to our computation), we replaced the real processing with a simple sleep loop that logs a heartbeat every 10 seconds:
elapsed = 0
while elapsed < 240:
time.sleep(10)
elapsed += 10
logger.info("SLEEP_TEST: %ds / 240s elapsed", elapsed)
return {"shape": \[256, 256, 50\]}
This was enabled via a SLEEP_TEST environment variable on the endpoint. Result:
the last log line we received was SLEEP_TEST: 80s / 240s elapsed. A 500 with
empty body was returned immediately after. The endpoint never reached the 90s
heartbeat.
This confirms a hard wall-clock timeout of approximately 80–90 seconds imposed
by the infrastructure, unrelated to our code or the specific computation being
performed.
Step 4 - verified the computation itself is not broken
We ran the same processing locally and it completed successfully in ~120s.
Question
Is there a configurable request timeout on dedicated CPU Inference Endpoints?
The ~80–90s hard kill appears to be imposed by a gateway or proxy layer (the
container process receives no signal we can intercept: there is no Python
exception, no SIGTERM handler triggered).
If this limit is fixed for the CPU tier, is there a higher-tier option or
configuration that supports longer-running synchronous requests? Alternatively,
is there a recommended pattern for CPU-bound tasks that exceed this duration
(e.g. polling, async task queues)?