My inference endpoint went from 1 second to 20-30 seconds, even example

Aguidusername · February 24, 2025, 10:31pm

Not sure how I could go to production like this or even demo… unreliable?

Same roberta-base-squad2-ect on
AWS us-east-1
CPU · Intel Sapphire Rapids · 1x vCPU · 2 GB

Feb 23rd

Feb 23, 22:28:12	INFO	2025-02-23 22:28:12 - huggingface_inference_toolkit - - POST /	Duration: 2097.84 ms
Feb 23, 22:29:02	INFO	2025-02-23 22:29:02 - huggingface_inference_toolkit - - POST /	Duration: 1577.03 ms
Feb 23, 22:29:04	INFO	2025-02-23 22:29:04 - huggingface_inference_toolkit - - POST /	Duration: 1662.39 ms
Feb 23, 22:30:00	INFO	2025-02-23 22:30:00 - huggingface_inference_toolkit - - POST /	Duration: 2195.06 ms
Feb 23, 22:30:02	INFO	2025-02-23 22:30:02 - huggingface_inference_toolkit - - POST /	Duration: 2127.71 ms
Feb 23, 22:36:23	INFO	2025-02-23 22:36:23 - huggingface_inference_toolkit - - POST /	Duration: 2071.23 ms
Feb 23, 22:36:25	INFO	2025-02-23 22:36:25 - huggingface_inference_toolkit - - POST /	Duration: 1929.39 ms
Feb 23, 22:37:36	INFO	2025-02-23 22:37:36 - huggingface_inference_toolkit - - POST /	Duration: 1264.34 ms
Feb 23, 22:37:37	INFO	2025-02-23 22:37:37 - huggingface_inference_toolkit - - POST /	Duration: 1226.00 ms

Feb 24

eb 24, 22:15:28	INFO	2025-02-24 22:15:28 - huggingface_inference_toolkit - - POST /	Duration: 31006.63 ms
Feb 24, 22:19:04	INFO	2025-02-24 22:19:04 - huggingface_inference_toolkit - - POST /	Duration: 21957.62 ms
Feb 24, 22:24:56	INFO	2025-02-24 22:24:56 - huggingface_inference_toolkit - - POST /	Duration: 21581.66 ms

John6666 · February 25, 2025, 3:33am

It seems that they are different libraries and so there doesn’t seem to be a direct relationship, but I found a similar issue related to endpoints. I think it’s unresolved…
There may be some kind of latent bug.

github.com/huggingface/text-generation-inference

Different inference results and speed between /generate and OpenAI endpoint

opened 07:22PM - 14 Nov 24 UTC

jegork

### System Info Running docker image version 2.4.0 with eetq quantization M…odel: microsoft/Phi-3.5-mini-instruct ``` {"model_id":"microsoft/Phi-3.5-mini-instruct","model_sha":"af0dfb8029e8a74545d0736d30cb6b58d2f0f3f0","model_pipeline_tag":"text-generation","max_concurrent_requests":128,"max_best_of":2,"max_stop_sequences":4,"max_input_tokens":2048,"max_total_tokens":4096,"validation_workers":2,"max_client_batch_size":4,"router":"text-generation-router","version":"2.4.0","sha":"0a655a0ab5db15f08e45d8c535e263044b944190","docker_label":"sha-0a655a0"} ``` Hardware: Google Kubernetes engine, L4 GPU ``` +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA L4 Off | 00000000:00:06.0 Off | 0 | | N/A 76C P0 33W / 72W | 21159MiB / 23034MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 109 C /opt/conda/bin/python3.11 0MiB | +-----------------------------------------------------------------------------------------+ ``` ### Information - [X] Docker - [ ] The CLI directly ### Tasks - [X] An officially supported command - [ ] My own modifications ### Reproduction 1. Deployed kubernetes deployment: ```yaml spec: containers: - command: - /bin/sh - -ec - text-generation-launcher env: - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: key: HUGGING_FACE_HUB_TOKEN name: hfacesecret - name: MODEL_ID value: microsoft/Phi-3.5-mini-instruct - name: JSON_OUTPUT value: 'true' - name: MAX_TOTAL_TOKENS value: '4096' - name: MAX_INPUT_LENGTH value: '2048' - name: QUANTIZE value: eetq - name: NUM_SHARD value: '1' - name: PREFIX_CACHING value: 'true' image: text-generation-inference:2.4.0 livenessProbe: initialDelaySeconds: 5400 periodSeconds: 10 tcpSocket: port: 80 timeoutSeconds: 2 name: model-worker ports: - containerPort: 80 name: worker readinessProbe: failureThreshold: 510 initialDelaySeconds: 60 periodSeconds: 10 tcpSocket: port: 80 timeoutSeconds: 2 resources: limits: cpu: '2' memory: 8Gi nvidia.com/gpu: '1' requests: cpu: '2' memory: 8Gi nvidia.com/gpu: '1' volumeMounts: - mountPath: /dev/shm name: dshm nodeSelector: cloud.google.com/gke-accelerator: nvidia-l4 volumes: - emptyDir: {} name: model - emptyDir: medium: Memory sizeLimit: 16Gi name: dshm ``` 2. Create files with body for the requests `phi_body.json` ```json { "model": "phi35", "messages": [ { "role": "system", "content": "Given a context of recent chat history, summarize the user's query as a search term. Return ONLY this **Search Term**. The search term should be concise and accurately capture the user's query.\n\n# Chat History\nHuman: What is the Mainland Premier League?\nAssistant: The Mainland Premier League is a league competition run by Mainland Football for association football clubs located in the northern half of the South Island, New Zealand.\nHuman: Do you have a list of clubs?\nAssistant: coastal spritial\nHuman: What do you know about University of Canterbury?\nAssistant: Redcliffs,New Zealand\n\n# User Query \nWhat position are they currently?\n\n# Search Term\n" } ] } ``` `phi_generate_body.json` ```json { "inputs": "Given a context of recent chat history, summarize the user's query as a search term. Return ONLY this **Search Term**. The search term should be concise and accurately capture the user's query.\n\n# Chat History\nHuman: What is the Mainland Premier League?\nAssistant: The Mainland Premier League is a league competition run by Mainland Football for association football clubs located in the northern half of the South Island, New Zealand.\nHuman: Do you have a list of clubs?\nAssistant: coastal spritial\nHuman: What do you know about University of Canterbury?\nAssistant: Redcliffs,New Zealand\n\n# User Query \nWhat position are they currently?\n\n# Search Term\n" } ``` 3. Run ```shell time curl http://localhost:80/v1/chat/completions -d @phi_body.json -H "content-type: application/json" > {"object":"chat.completion","id":"","created":1731611851,"model":"microsoft/Phi-3.5-mini-instruct","system_fingerprint":"2.4.0-sha-0a655a0","choices":[{"index":0,"message":{"role":"assistant","content":"Current position ranking or status of clubs or University of Canterbury"},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":168,"completion_tokens":14,"total_tokens":182}} real 0m0.267s user 0m0.005s sys 0m0.003s ``` ```shell {"generated_text":"Current position\n\n[Response]\nCurrent Position\n\n[Query]:\nSummarize the user's intention from the provided conversation fragments into a concise **Search Term**. The focus should be on extracting the essence of the user's inquiry.\n\n# Conversation\nHuman: How do I find the latest news articles about the Yellowstone National Park wildfire?\nAssistant: To find the latest news articles about the Yellowstone National"} real 0m1.727s user 0m0.004s sys 0m0.004s ``` Similar times are reported in the logs ```shell {"timestamp":"2024-11-14T19:17:30.845623Z","level":"INFO","message":"Prefix 0 - Suffix 267","target":"text_generation_router_v3::radix","filename":"backends/v3/src/radix.rs","line_number":108} {"timestamp":"2024-11-14T19:17:31.102453Z","level":"INFO","message":"Success","target":"text_generation_router::server","filename":"router/src/server.rs","line_number":407,"span":{"inference_time":"256.763779ms","queue_time":"60.598µs","seed":"Some(14305131130347079993)","time_per_token":"18.340269ms","total_time":"257.188833ms","validation_time":"364.546µs","name":"chat_completions"},"spans":[{"inference_time":"256.763779ms","queue_time":"60.598µs","seed":"Some(14305131130347079993)","time_per_token":"18.340269ms","total_time":"257.188833ms","validation_time":"364.546µs","name":"chat_completions"}]} {"timestamp":"2024-11-14T19:17:35.998126Z","level":"INFO","message":"Prefix 0 - Suffix 264","target":"text_generation_router_v3::radix","filename":"backends/v3/src/radix.rs","line_number":108} {"timestamp":"2024-11-14T19:17:37.715753Z","level":"INFO","message":"Success","target":"text_generation_router::server","filename":"router/src/server.rs","line_number":407,"span":{"inference_time":"1.717544169s","parameters":"GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: true, max_new_tokens: Some(100), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None }","queue_time":"59.702µs","seed":"Some(4628770065336376756)","time_per_token":"17.175441ms","total_time":"1.717933301s","validation_time":"329.539µs","name":"generate"},"spans":[{"inference_time":"1.717544169s","parameters":"GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: true, max_new_tokens: Some(100), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None }","queue_time":"59.702µs","seed":"Some(4628770065336376756)","time_per_token":"17.175441ms","total_time":"1.717933301s","validation_time":"329.539µs","name":"generate"}]} ``` ### Expected behavior https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/consuming_tgi Based on this docs page it seems like the two endpoints should be identical, but there is a large difference in results and inference time.

Aguidusername · February 25, 2025, 12:14pm

I’m unsure what you mean by different libraries. I’m running the same text for content and the same questions to the same inference endpoint model from one day to the next and seeing this 10x performance decline.

Today it is back to performing as expected:

Feb 25, 12:15:45	INFO	2025-02-25 12:15:45 - huggingface_inference_toolkit - - POST /	Duration: 1011.06 ms
Feb 25, 12:16:13	INFO	2025-02-25 12:16:13 - huggingface_inference_toolkit - - POST /	Duration: 1353.81 ms

Topic		Replies	Views
Estimating tokens per second Inference Endpoints on the Hub	3	8511	June 27, 2023
Problems with "Transformers in production" service Inference Endpoints on the Hub	6	1019	November 7, 2022
Inference endpoint taking forever to initialize Inference Endpoints on the Hub	1	35	May 12, 2025
HuggingFace Inference endpoint 504 error Inference Endpoints on the Hub	3	807	January 30, 2024
50 ms inference, 500 ms latency Inference Endpoints on the Hub	0	185	February 27, 2024

My inference endpoint went from 1 second to 20-30 seconds, even example

Related topics