I’m attempting to call the Hugging Face Inference API at:
https://api-inference.huggingface.co/models/meta-llama/Llama-3.1-8B
using the headers:
{
"Authorization": "Bearer <MY_API_KEY>",
"Content-Type": "application/json",
"X-Use-Cache": "false",
"X-Wait-For-Model": "true",
"X-Inference-Provider": "cerebras"
}
and a high local timeout (300s). However, my request frequently fails with a Read timed-out error at 120 seconds. Based on the logs, the connection is being closed server-side at 120s, regardless of my local timeout setting.
Could you please confirm:
-
Is there a 120s server-side limit for the Llama 3.1-8B model on my current plan (or in general)?
-
Does the
X-Inference-Provider: cerebras
header support longer inference times, or is there a specific limit on that provider? -
Do I need a specific plan or configuration to allow requests that take more than 120 seconds to complete?
I’ve tried:
-
Lowering
max_new_tokens
-
Reducing concurrency
-
Adding a tuple timeout like
(60, 300)
But the endpoint still times out after about 120 seconds with a Read timeout
. I’d appreciate any guidance on enabling longer-running requests or clarifying the maximum allocated time for the inference endpoint.