### System Info
Running docker image version 2.4.0 with eetq quantization
M…odel: microsoft/Phi-3.5-mini-instruct
```
{"model_id":"microsoft/Phi-3.5-mini-instruct","model_sha":"af0dfb8029e8a74545d0736d30cb6b58d2f0f3f0","model_pipeline_tag":"text-generation","max_concurrent_requests":128,"max_best_of":2,"max_stop_sequences":4,"max_input_tokens":2048,"max_total_tokens":4096,"validation_workers":2,"max_client_batch_size":4,"router":"text-generation-router","version":"2.4.0","sha":"0a655a0ab5db15f08e45d8c535e263044b944190","docker_label":"sha-0a655a0"}
```
Hardware: Google Kubernetes engine, L4 GPU
```
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L4 Off | 00000000:00:06.0 Off | 0 |
| N/A 76C P0 33W / 72W | 21159MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 109 C /opt/conda/bin/python3.11 0MiB |
+-----------------------------------------------------------------------------------------+
```
### Information
- [X] Docker
- [ ] The CLI directly
### Tasks
- [X] An officially supported command
- [ ] My own modifications
### Reproduction
1. Deployed kubernetes deployment:
```yaml
spec:
containers:
- command:
- /bin/sh
- -ec
- text-generation-launcher
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
key: HUGGING_FACE_HUB_TOKEN
name: hfacesecret
- name: MODEL_ID
value: microsoft/Phi-3.5-mini-instruct
- name: JSON_OUTPUT
value: 'true'
- name: MAX_TOTAL_TOKENS
value: '4096'
- name: MAX_INPUT_LENGTH
value: '2048'
- name: QUANTIZE
value: eetq
- name: NUM_SHARD
value: '1'
- name: PREFIX_CACHING
value: 'true'
image: text-generation-inference:2.4.0
livenessProbe:
initialDelaySeconds: 5400
periodSeconds: 10
tcpSocket:
port: 80
timeoutSeconds: 2
name: model-worker
ports:
- containerPort: 80
name: worker
readinessProbe:
failureThreshold: 510
initialDelaySeconds: 60
periodSeconds: 10
tcpSocket:
port: 80
timeoutSeconds: 2
resources:
limits:
cpu: '2'
memory: 8Gi
nvidia.com/gpu: '1'
requests:
cpu: '2'
memory: 8Gi
nvidia.com/gpu: '1'
volumeMounts:
- mountPath: /dev/shm
name: dshm
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-l4
volumes:
- emptyDir: {}
name: model
- emptyDir:
medium: Memory
sizeLimit: 16Gi
name: dshm
```
2. Create files with body for the requests
`phi_body.json`
```json
{
"model": "phi35",
"messages": [
{
"role": "system",
"content": "Given a context of recent chat history, summarize the user's query as a search term. Return ONLY this **Search Term**. The search term should be concise and accurately capture the user's query.\n\n# Chat History\nHuman: What is the Mainland Premier League?\nAssistant: The Mainland Premier League is a league competition run by Mainland Football for association football clubs located in the northern half of the South Island, New Zealand.\nHuman: Do you have a list of clubs?\nAssistant: coastal spritial\nHuman: What do you know about University of Canterbury?\nAssistant: Redcliffs,New Zealand\n\n# User Query \nWhat position are they currently?\n\n# Search Term\n"
}
]
}
```
`phi_generate_body.json`
```json
{
"inputs": "Given a context of recent chat history, summarize the user's query as a search term. Return ONLY this **Search Term**. The search term should be concise and accurately capture the user's query.\n\n# Chat History\nHuman: What is the Mainland Premier League?\nAssistant: The Mainland Premier League is a league competition run by Mainland Football for association football clubs located in the northern half of the South Island, New Zealand.\nHuman: Do you have a list of clubs?\nAssistant: coastal spritial\nHuman: What do you know about University of Canterbury?\nAssistant: Redcliffs,New Zealand\n\n# User Query \nWhat position are they currently?\n\n# Search Term\n"
}
```
3. Run
```shell
time curl http://localhost:80/v1/chat/completions -d @phi_body.json -H "content-type: application/json"
> {"object":"chat.completion","id":"","created":1731611851,"model":"microsoft/Phi-3.5-mini-instruct","system_fingerprint":"2.4.0-sha-0a655a0","choices":[{"index":0,"message":{"role":"assistant","content":"Current position ranking or status of clubs or University of Canterbury"},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":168,"completion_tokens":14,"total_tokens":182}}
real 0m0.267s
user 0m0.005s
sys 0m0.003s
```
```shell
{"generated_text":"Current position\n\n[Response]\nCurrent Position\n\n[Query]:\nSummarize the user's intention from the provided conversation fragments into a concise **Search Term**. The focus should be on extracting the essence of the user's inquiry.\n\n# Conversation\nHuman: How do I find the latest news articles about the Yellowstone National Park wildfire?\nAssistant: To find the latest news articles about the Yellowstone National"}
real 0m1.727s
user 0m0.004s
sys 0m0.004s
```
Similar times are reported in the logs
```shell
{"timestamp":"2024-11-14T19:17:30.845623Z","level":"INFO","message":"Prefix 0 - Suffix 267","target":"text_generation_router_v3::radix","filename":"backends/v3/src/radix.rs","line_number":108}
{"timestamp":"2024-11-14T19:17:31.102453Z","level":"INFO","message":"Success","target":"text_generation_router::server","filename":"router/src/server.rs","line_number":407,"span":{"inference_time":"256.763779ms","queue_time":"60.598µs","seed":"Some(14305131130347079993)","time_per_token":"18.340269ms","total_time":"257.188833ms","validation_time":"364.546µs","name":"chat_completions"},"spans":[{"inference_time":"256.763779ms","queue_time":"60.598µs","seed":"Some(14305131130347079993)","time_per_token":"18.340269ms","total_time":"257.188833ms","validation_time":"364.546µs","name":"chat_completions"}]}
{"timestamp":"2024-11-14T19:17:35.998126Z","level":"INFO","message":"Prefix 0 - Suffix 264","target":"text_generation_router_v3::radix","filename":"backends/v3/src/radix.rs","line_number":108}
{"timestamp":"2024-11-14T19:17:37.715753Z","level":"INFO","message":"Success","target":"text_generation_router::server","filename":"router/src/server.rs","line_number":407,"span":{"inference_time":"1.717544169s","parameters":"GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: true, max_new_tokens: Some(100), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None }","queue_time":"59.702µs","seed":"Some(4628770065336376756)","time_per_token":"17.175441ms","total_time":"1.717933301s","validation_time":"329.539µs","name":"generate"},"spans":[{"inference_time":"1.717544169s","parameters":"GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: true, max_new_tokens: Some(100), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None }","queue_time":"59.702µs","seed":"Some(4628770065336376756)","time_per_token":"17.175441ms","total_time":"1.717933301s","validation_time":"329.539µs","name":"generate"}]}
```
### Expected behavior
https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/consuming_tgi
Based on this docs page it seems like the two endpoints should be identical, but there is a large difference in results and inference time.