Triage `Gateway Time-out`

mikedelorimier · November 20, 2025, 2:40am

My calls to Deepseek-R1 sometimes time out. This timeout is weakly correlated with the prompt size and is nondeterministic. Increasing max_tokens didn’t seem to help. I would like to know what information I can get to help triage this timeout. I do have a paid subscription.

John6666 · November 20, 2025, 5:07am

Which endpoint caused the error…?

Here is a concrete way to triage your DeepSeek-R1 timeouts, step by step.

0. Background: what “Gateway Time-out” really means here

When you call DeepSeek-R1, the actual path is usually:

Your code → HTTP client → (optional) your API server / proxy / serverless function → (optional) provider (SiliconFlow, OpenRouter, NIM, Azure, etc.) → DeepSeek backend

A 504 Gateway Timeout (or similar “request timed out”) almost always means:

Some gateway in the middle waited too long for its upstream and gave up,
Not that “the model hit max_tokens” or that your request was syntactically invalid.

General HTTP background: 504 is a server-side error that occurs when a gateway/proxy doesn’t get a timely response from the upstream server; the client is usually not the root cause.(Kinsta®)

For DeepSeek-R1 specifically:

DeepSeek’s own docs emphasize non-streaming is default and slower; streaming is recommended for better interactivity.(api-docs.deepseek.com)
SiliconFlow (a common R1 provider) explicitly warns that in non-stream requests, long output is prone to 504 timeout issues, and recommends streaming + longer client timeouts.(docs.siliconflow.cn)
Proxies and orchestrators (Dify, LiteLLM, etc.) have concrete bugs/issues where they, not DeepSeek, emit 504 + HTML error pages when long R1 calls run too long under load.(GitHub)

Your description:

Weak correlation with prompt size,
Nondeterministic failures,
Increasing max_tokens doesn’t help,

fits very well with this pattern: some outer time limit + variable backend latency + queuing under load.

So the triage goal is:

“Find out which layer is timing out, and under what conditions.”

Everything below is structured around that.

1. Minimum useful data per request (what to log)

To triage timeouts, capture the same structured information for every request, both successful and failed. For 504s this data is extremely valuable.

1.1 Request metadata

For each call (especially failures), log:

Timestamp: ISO 8601 with timezone (preferably UTC).
Environment: e.g., “prod / us-east-1 / Cloudflare → OpenRouter → DeepSeek”, or “local → api.deepseek.com”.
Target endpoint:
- Host (e.g. api.deepseek.com, api.openrouter.ai, api.siliconflow.cn, your own domain).
- Path (e.g. /v1/chat/completions).
Model: exact name: deepseek-ai/DeepSeek-R1, deepseek-r1, deepseek-r1-distill-*, etc.
Key parameters:
- stream flag (true/false).
- max_tokens (or max_completion_tokens).
- temperature, top_p, top_k (just for reproducibility).
Prompt size estimate:
- Approximate input token count (using a local tokenizer for R1/V3; doesn’t need to be exact).
- A short summary string of what the prompt is doing (e.g., “large code review”, “multi-page PDF analysis”).

This is enough to later slice the data by streaming vs non-streaming, token count, or environment.

1.2 Response metadata

When a timeout happens, the shape of the response tells you which layer failed.

Log:

HTTP status and reason: 504 Gateway Timeout, 504 Gateway Time-out, or client-side timeout exception text.
Response body:
- If it’s HTML (starts with <!DOCTYPE html> or <html>), that strongly indicates a gateway/proxy error (e.g., LiteLLM proxy, Dify cloud, serverless platform). This is exactly what Dify and LiteLLM issues show for DeepSeek-related 504s.(GitHub)
- If it’s JSON, check whether:
  - It looks like a provider error format (e.g., OpenAI-style {"error": { "code": ... }}) or
  - It looks like a DeepSeek/aggregator JSON error (codes like 400–503, but not 504).
- If there is no response body at all (connection closed / client timeout), record that too.
Response headers:
- Date, Server.
- Proxy headers: Via, CF-*, X-Forwarded-*.
- Provider IDs: x-request-id, x-trace-id, x-openrouter-id, x-siliconflow-request-id, x-azure-ref, etc.

These headers and body type are what let you say “this came from my proxy” vs “this came from the provider” vs “this came from DeepSeek.”

1.3 Timing data

For each request, log:

Client-side timeout settings:
- Connect timeout.
- Read / response timeout.
Measured durations:
- t_start → t_first_byte (time until first response byte/event).
- t_first_byte → t_last_byte (time to stream or receive the rest).
- Total time when the error was raised.

Then you can answer:

Do 504s cluster around 60 seconds (typical serverless/gateway limit)?
Around 120 seconds (Azure-style / Dify-style limits)?(GitHub)
Around 180 seconds (some cloud gateways)?(BytePlus)
Or shorter (e.g., your HTTP client default)?

BytePlus’ DeepSeek-R1 timeout guide explicitly distinguishes client-side timeout, gateway timeout, and model inference time, and recommends tracking per-request latency to see which one you’re hitting.(BytePlus)

2. Step-by-step triage process

Once basic logging is in place, use a structured process so you can narrow down the cause.

Step 1 – Identify which layer is producing the timeout

Using the response shape + headers:

HTML 504 page (gateway error)
- HTML body with a generic 504 page, often with a server name or brand (e.g., Dify’s api.dify.ai HTML 504, LiteLLM’s HTML 504).(GitHub)
- This almost certainly means: some proxy / platform timed out before DeepSeek finished.
JSON error from provider
- Provider-specific JSON (OpenRouter, SiliconFlow, Azure, etc.) with error codes like 429, 500, 503.
- Here, the timeout may be inside their system, but you now have a structured error, plus request IDs to send to their support.
No HTTP response (client timeout)
- Your HTTP client gave up before any HTTP status was received.
- In this case, it’s your own timeout setting that is too short for worst-case R1 latency.

Combine this with the host you’re calling (DeepSeek directly vs aggregator vs your own API) to decide which support team or configuration to look at.

Step 2 – Compare streaming vs non-streaming

DeepSeek’s own docs: API defaults to non-stream, but web UI uses streaming; streaming improves interactivity and often perceived speed.(api-docs.deepseek.com)
SiliconFlow explicitly: “In non-stream requests, long output content is prone to 504 timeout issues” for DeepSeek R1-series; they recommend streaming and longer client timeout.(docs.siliconflow.cn)

So for triage:

Take a prompt that sometimes times out.
Run it:
- Non-stream (stream=false).
- Stream (stream=true), using the same model and similar max_tokens.
Compare:
- Does streaming succeed much more reliably?
- Does the time to first token improve significantly?

If streaming is much more reliable, you have strong evidence that long total response time is hitting a gateway limit, not a DeepSeek “max tokens” or syntactic issue.

Several systems around DeepSeek show exactly this pattern:

LiteLLM issue: long non-stream requests through the proxy result in 504, but the same prompts succeed with streaming.(GitHub)
SiliconFlow FAQ for DeepSeek: specifically calls out non-stream R1 responses as prone to 504.(docs.siliconflow.cn)

Step 3 – Vary prompt size under controlled concurrency

You already observed “weak correlation with prompt size.” To quantify that:

Fix concurrency (e.g., only 1–2 parallel requests).
Run a load test with:
- Small prompts (few hundred tokens).
- Medium prompts (2–4k tokens).
- Large prompts (8k+ tokens, if allowed).
For each bucket, record:
- Success rate vs 504 rate.
- Median and p95/p99 latencies.

If larger prompts significantly increase latency and 504 rate, but not in a perfectly linear way, that is typical for “heavy model + dynamic queuing + fixed timeout” behavior, which matches reports from NIM, OpenRouter, and Reddit users: sometimes DeepSeek-based calls are fast, sometimes they stall and time out under load.(NVIDIA Developer Forums)

Step 4 – Vary concurrency

DeepSeek and some providers apply dynamic rate limits and queuing depending on traffic, even for paid plans. External guides and community reports describe increased timeouts under bursts or high concurrency, not just large prompts.(BytePlus)

For triage:

Keep prompt content fixed.
Run:
- Low concurrency (1–2 in flight).
- Moderate concurrency (5–10 in flight).
- High concurrency (whatever your real workload is).
For each, measure:
- 504 frequency.
- Latencies.
- Any rate-limit or “server busy” style messages.

If 504s appear mainly at higher concurrency, that suggests:

Your provider or gateway is hitting internal limits and letting requests queue long enough to time out, or
DeepSeek’s back-end slot allocation to your account is saturated and some requests get deprioritized, surfacing as timeouts in outer layers (exactly what some timeout-focused guides for R1 discuss).(BytePlus)

Step 5 – Correlate timing with common gateway limits

Use the timing data:

If almost all 504s happen at ~60 seconds, suspect:
- Serverless function limit (e.g., Vercel/Cloudflare function),
- A default HTTP client timeout.
If they cluster around ~120 seconds, this resembles service-level timeouts described in:
- Dify’s api.dify.ai 504 issue (120-second orchestration vs Python client timeout).(GitHub)
- Various Azure / BytePlus examples (120 s–180 s style limits).(BytePlus)
If they happen at irregular times and you see non-timeout errors too (429, 5xx from provider), capacity or temporary incidents are more likely; OpenRouter, NIM, and others show these patterns during DeepSeek-related load spikes.(NVIDIA Developer Forums)

This gives your provider something concrete to check in their logs.

4. Example of a simple “timeout triage log entry”

You do not need to use this exact format, but a structured log like this per request is ideal:

// Example telemetry for one DeepSeek-R1 call
{
  "timestamp_utc": "2025-11-20T10:34:12Z",
  "env": "prod-us-east-1",
  "endpoint": "https://api.deepseek.com/v1/chat/completions",
  "model": "deepseek-ai/DeepSeek-R1",
  "stream": false,
  "max_tokens": 4096,
  "temperature": 0.6,
  "input_tokens": 6200,
  "concurrency_at_start": 7,

  "client_timeouts": {
    "connect_timeout_s": 10,
    "read_timeout_s": 60
  },

  "timing": {
    "time_to_first_byte_ms": null,
    "time_to_last_byte_ms": null,
    "total_time_ms": 60050
  },

  "http": {
    "status": 504,
    "reason": "Gateway Time-out",
    "body_type": "html",           // or "json" / "none"
    "body_snippet": "<!DOCTYPE html><html>...",
    "response_headers": {
      "Server": "my-proxy",
      "Date": "Thu, 20 Nov 2025 10:35:12 GMT",
      "x-request-id": "abcd-1234"
    }
  }
}

With a few of these from different scenarios (stream vs non-stream, small vs large prompt, low vs high concurrency), you can systematically answer:

Which layer is sending 504?
At what time limits?
Under what workloads?

5. Short, practical triage checklist

You can treat this as an operational runbook:

Add structured logging for every DeepSeek-R1 call:
- Request metadata (model, stream, tokens, concurrency).
- Response status/body type/headers.
- Timings.
Classify 504s:
- HTML response ⇒ gateway/proxy/platform.
- JSON error ⇒ provider / API-level issue.
- No response ⇒ your own client timeout.
Test streaming vs non-stream:
- Same prompt, both modes.
- If streaming is stable while non-stream 504s, you are hitting a response-duration limit in some gateway.(docs.siliconflow.cn)
Test prompt size under fixed concurrency:
- Small/medium/large. Log latency and 504 rate.
Test concurrency under fixed prompt:
- Scale from 1 → N parallel calls.
- Note how latency and 504 rate change.
Compare timing to known limits:
- ~60 s → likely serverless / HTTP client limit.
- ~120–180 s → likely provider/gateway limit (Dify, Azure, BytePlus examples).(BytePlus)
Prepare a support bundle:
- 2–5 representative failing requests with full metadata.
- Aggregated stats (504 percentage, correlation notes).
Implement basic mitigations while triaging:
- Enable streaming for heavy prompts where possible.(api-docs.deepseek.com)
- Increase client read timeout to exceed typical model latency.(BytePlus)
- Reduce burst concurrency or add retry with exponential backoff.(BytePlus)

6. Curated external references (with why they are useful)

A few focused links that line up with your situation:

A. DeepSeek-specific timeout and 504 discussions

BytePlus – “API request timeout for DeepSeek-R1 model”
Practical guide to distinguishing client vs server vs gateway timeouts when using DeepSeek-R1, with recommendations on timeouts, retries, and monitoring.(BytePlus)
SiliconFlow FAQ – DeepSeek R1 series & 504
Explicitly states that non-stream DeepSeek-R1 calls with long outputs are prone to 504, and recommends using streaming and tuning client timeouts. Good confirmation of the streaming vs non-stream pattern.(docs.siliconflow.cn)
Reddit – “DeepSeek API: Every Request Is A Timeout :(”
Real-world report of repeated timeouts with DeepSeek APIs and discussion of switching providers/fallbacks when DeepSeek is slow or overloaded. Useful for understanding behavior under heavy public load.(Reddit)

B. Proxy / platform 504 case studies

GitHub – Dify “DeepSeek Plugin – 504 Gateway Timeout Error”
Shows what a 504 looks like when the error originates in a plugin/gateway (HTML response, not JSON), and how it can mask underlying success. Good for recognizing this pattern in your own logs.(GitHub)
GitHub – LiteLLM “504 Gateway Time-out … Long Non-Streaming Requests”
DeepSeek-related 504s from a proxy, fixed by switching to streaming. Strong validation that streaming vs non-stream is a key dimension to toggle.(GitHub)
NVIDIA Developer Forum – 504s with DeepSeek on NIM
Shows gateway timeouts and slower inference with DeepSeek models under platform load, plus the effect of queue depths and incidents. Good background on provider-side constraints.(NVIDIA Developer Forums)

C. General HTTP / DeepSeek R1 troubleshooting

GeeksforGeeks – “How to Fix HTTP Request DeepSeek R1 Issue”
Beginner-friendly overview of typical DeepSeek-R1 HTTP issues (timeouts, network problems, configuration mistakes) and suggested steps to resolve them. Good for broad background.(GeeksforGeeks)
Kinsta – “504 Gateway Timeout error” explainer
Clear explanation of what 504 means in HTTP terms: a gateway timed out waiting for another server. Useful base knowledge for your mental model.(Kinsta®)

These external references line up with the triage steps above and show that your symptoms (nondeterministic timeouts, weak correlation with prompt size, max_tokens not helping) are consistent with how DeepSeek-R1 behaves when it runs into outer time limits and platform load, rather than a simple “you set max_tokens too low” problem.

Topic		Replies	Views
Time out on all API model Models	4	470	December 12, 2023
Problem with launching DeepSeek-R1-Distill-Qwen-32B-Uncensored-Q8_0-GGUF Models	32	589	March 18, 2025
Performance of hosted inference API Beginners	0	303	February 16, 2021
HTTPError: 504 Server Error: Gateway Time-out for url: https://huggingface.co/api/models/facebook/bart-large-cnn 🤗Transformers	0	1314	January 17, 2022
Google/pegasus-arxiv always times out Beginners	0	240	December 21, 2020