Hi all,
im new to trying to run an LLM locally and im struggling at the moment quite a bit. What I want to do is have an LLM produce SQL, however Im having issues with the LLM producing gibberish output. Since Im new to the filed there are a lot of unknown unknowns, so any help is greatly appreciated.
Context:
Creating two containers, one that is the chat-ui, the other runs Text Generation Inference from HuggingFace.
Input to chat-ui:
“How many tables are there?”
Desired output:
The SQL statement that would produce the answer to the question.
Problem:
The LLM outputs text that has nothing to do with the input or with the db context that is passed to it. Regardless of prompt (either with DB context or simply “Be a helpful assistant”, irrelevant text is generated)
Set-Up:
Model: microsoft/Phi-3-mini-4k-instruct
Im just going to share my docker-compose.yaml file for simplicity:
services:
tgi:
image:
container_name: tgi
ports:
- 8080:80
environment:
- HUGGING_FACE_HUB_TOKEN=${HUGGINGFACEHUB_API_TOKEN}
- RUST_LOG=trace
- NUM_SHARD=1
- NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=compute,utility
# need this to access GPU
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
command:
- '--huggingface-hub-cache'
- '/root/.cache/huggingface'
- '--model-id'
- '${MODEL_ID}'
- '--max-batch-prefill-tokens'
- '${MAX_BATCH_PREFILL_TOKENS}'
- '--quantize'
- 'bitsandbytes-nf4'
- '--max-total-tokens'
- '1024'
- '--max-input-length'
- '500'
shm_size: 1gb
volumes:
- huggingface_cache:/root/.cache/huggingface #create locl file to store huggingface cache
ui:
image: localllm-ui:latest
container_name: ui
build:
context: ./chat_ui/
ports:
- 7000:7000
# index:
# image:
# ports:
# - 8088:80
# api:
# image:
volumes:
huggingface_cache: {}ghcr.io/huggingface/text-generation-inference:latest
The following is my gradio app. I share this as well since there are some parameters passed to the TGI as well. Im so new to this I dont know where potential problems could even be, so just sharing just in case.
db = SQLDatabase.from_uri("sqlite:///./Chinook.db")
def call_llm():
llm = HuggingFaceEndpoint(
#inference_server_url="http://tgi",
endpoint_url = "http://tgi",
max_new_tokens=50, #small number to keep responses short and more chat-like
top_k=15,
top_p=0.95,
typical_p=0.95,
temperature=0.1,
repetition_penalty=2.0,
streaming=True,
)
return llm
# Initialize chat model
llm = call_llm()
simple_prompt = RunnableLambda(lambda _: "You are a helpful assistant.")
chain = (
#RunnablePassthrough.assign(
# history=RunnableLambda(memory.load_memory_variables) | itemgetter("history")
#)
simple_prompt
| llm
)
def stream_response(input, history):
if input is not None:
partial_message = ""
# ChatInterface struggles with rendering stream
# make the call to the bot
for response in chain.stream({
"input" : input
}):
partial_message += response
yield partial_message
gr.ChatInterface(stream_response, analytics_enabled=False).queue(default_concurrency_limit=None).launch(debug=True, server_name='0.0.0.0', server_port=7000, share=False)
CUDA is being used in the docker container. The following output is from nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120 Driver Version: 550.120 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3050 ... Off | 00000000:01:00.0 Off | N/A |
| N/A 35C P8 5W / 35W | 3721MiB / 4096MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2866 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 1257129 C /usr/src/server/.venv/bin/python3 3708MiB |
+-----------------------------------------------------------------------------------------+
Output from htop just to see some more system specs in case that helps:
Here is the debug output when starting the Text Generation Inference Server:
docker-compose logs -f
tgi | 2025-02-16T19:15:18.756780Z INFO text_generation_launcher: Args {
tgi | model_id: “microsoft/Phi-3-mini-4k-instruct”,
tgi | revision: None,
tgi | validation_workers: 2,
tgi | sharded: None,
tgi | num_shard: Some(
tgi | 1,
tgi | ),
tgi | quantize: Some(
tgi | BitsandbytesNf4,
tgi | ),
tgi | speculate: None,
tgi | dtype: None,
tgi | kv_cache_dtype: None,
tgi | trust_remote_code: false,
tgi | max_concurrent_requests: 128,
tgi | max_best_of: 2,
tgi | max_stop_sequences: 4,
tgi | max_top_n_tokens: 5,
tgi | max_input_tokens: None,
tgi | max_input_length: Some(
tgi | 500,
tgi | ),
tgi | max_total_tokens: Some(
tgi | 1024,
tgi | ),
tgi | waiting_served_ratio: 0.3,
tgi | max_batch_prefill_tokens: Some(
tgi | 500,
tgi | ),
tgi | max_batch_total_tokens: None,
tgi | max_waiting_tokens: 20,
tgi | max_batch_size: None,
tgi | cuda_graphs: None,
tgi | hostname: “caf9f8f55083”,
tgi | port: 80,
tgi | shard_uds_path: “/tmp/text-generation-server”,
tgi | master_addr: “localhost”,
tgi | master_port: 29500,
tgi | huggingface_hub_cache: Some(
tgi | “/root/.cache/huggingface”,
tgi | ),
tgi | weights_cache_override: None,
tgi | disable_custom_kernels: false,
tgi | cuda_memory_fraction: 1.0,
tgi | rope_scaling: None,
tgi | rope_factor: None,
tgi | json_output: false,
tgi | otlp_endpoint: None,
tgi | otlp_service_name: “text-generation-inference.router”,
tgi | cors_allow_origin: ,
tgi | api_key: None,
tgi | watermark_gamma: None,
tgi | watermark_delta: None,
tgi | ngrok: false,
tgi | ngrok_authtoken: None,
tgi | ngrok_edge: None,
tgi | tokenizer_config_path: None,
tgi | disable_grammar_support: false,
tgi | env: false,
tgi | max_client_batch_size: 4,
tgi | lora_adapters: None,
tgi | usage_stats: On,
tgi | payload_limit: 2000000,
tgi | enable_prefill_logprobs: false,
tgi | }
Some ideas as to why it might not work locally:
Too little VRAM on my graphics card
- I chose the model because it is the biggest that still fits (with quantization) onto my GPU. However I dont know how much VRAM is needed when actually doing inference. The model is listed as supported models by HuggingFace for the TGI.
There is some mismatch between the model, the TGI, and how IM supplying messages to the LLM.
Maybe the model expects input in a different format.
Any help or any pointers would be greatly appreciated. Thanks