Inquiry About 120s Timeout on Hugging Face Inference Endpoint for Llama 3.1-8B

I’m attempting to call the Hugging Face Inference API at:

https://api-inference.huggingface.co/models/meta-llama/Llama-3.1-8B

using the headers:

{
  "Authorization": "Bearer <MY_API_KEY>",
  "Content-Type": "application/json",
  "X-Use-Cache": "false",
  "X-Wait-For-Model": "true",
  "X-Inference-Provider": "cerebras"
}

and a high local timeout (300s). However, my request frequently fails with a Read timed-out error at 120 seconds. Based on the logs, the connection is being closed server-side at 120s, regardless of my local timeout setting.

Could you please confirm:

  1. Is there a 120s server-side limit for the Llama 3.1-8B model on my current plan (or in general)?

  2. Does the X-Inference-Provider: cerebras header support longer inference times, or is there a specific limit on that provider?

  3. Do I need a specific plan or configuration to allow requests that take more than 120 seconds to complete?

I’ve tried:

  • Lowering max_new_tokens

  • Reducing concurrency

  • Adding a tuple timeout like (60, 300)

But the endpoint still times out after about 120 seconds with a Read timeout. I’d appreciate any guidance on enabling longer-running requests or clarifying the maximum allocated time for the inference endpoint.

2 Likes

Apparently, a discussion forum for Inference Providers was set up yesterday or today, so this might be quicker.