Local LLM not working; just produces gibberish. Looking for help for a beginner

Hi all,
im new to trying to run an LLM locally and im struggling at the moment quite a bit. What I want to do is have an LLM produce SQL, however Im having issues with the LLM producing gibberish output. Since Im new to the filed there are a lot of unknown unknowns, so any help is greatly appreciated.

Context:

Creating two containers, one that is the chat-ui, the other runs Text Generation Inference from HuggingFace.
Input to chat-ui:
“How many tables are there?”

Desired output:
The SQL statement that would produce the answer to the question.

Problem:
The LLM outputs text that has nothing to do with the input or with the db context that is passed to it. Regardless of prompt (either with DB context or simply “Be a helpful assistant”, irrelevant text is generated)

Set-Up:
Model: microsoft/Phi-3-mini-4k-instruct

Im just going to share my docker-compose.yaml file for simplicity:

services:
  tgi:
    image: 
    container_name: tgi
    ports:
      - 8080:80

    environment:
      - HUGGING_FACE_HUB_TOKEN=${HUGGINGFACEHUB_API_TOKEN}
      - RUST_LOG=trace
      - NUM_SHARD=1
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=compute,utility
    # need this to access GPU
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command: 
      - '--huggingface-hub-cache' 
      - '/root/.cache/huggingface'
      - '--model-id'
      - '${MODEL_ID}' 
      - '--max-batch-prefill-tokens'
      - '${MAX_BATCH_PREFILL_TOKENS}' 
      - '--quantize'
      - 'bitsandbytes-nf4'
      - '--max-total-tokens'
      - '1024'
      - '--max-input-length'
      - '500'
    shm_size: 1gb
    volumes:
      - huggingface_cache:/root/.cache/huggingface #create locl file to store huggingface cache
  ui:
    image: localllm-ui:latest
    container_name: ui
    build:
      context: ./chat_ui/
    ports:
      - 7000:7000
  # index:
  #   image:
  #   ports:
  #     - 8088:80

  # api:
  #   image:
volumes:
  huggingface_cache: {}ghcr.io/huggingface/text-generation-inference:latest

The following is my gradio app. I share this as well since there are some parameters passed to the TGI as well. Im so new to this I dont know where potential problems could even be, so just sharing just in case.

db = SQLDatabase.from_uri("sqlite:///./Chinook.db")



def call_llm():

    llm = HuggingFaceEndpoint(
        #inference_server_url="http://tgi",
        endpoint_url = "http://tgi",
        max_new_tokens=50, #small number to keep responses short and more chat-like
        top_k=15,
        top_p=0.95,
        typical_p=0.95,
        temperature=0.1,
        repetition_penalty=2.0,
        streaming=True,
    )
    return llm   

# Initialize chat model
llm = call_llm()

simple_prompt = RunnableLambda(lambda _: "You are a helpful assistant.")

chain = (
    #RunnablePassthrough.assign(
    #    history=RunnableLambda(memory.load_memory_variables) | itemgetter("history")
    #)
    simple_prompt
    | llm
)
def stream_response(input, history):
    if input is not None:
        partial_message = ""
        # ChatInterface struggles with rendering stream
        # make the call to the bot
        for response in chain.stream({
            "input" : input
        }):
            partial_message += response
            yield partial_message 

gr.ChatInterface(stream_response, analytics_enabled=False).queue(default_concurrency_limit=None).launch(debug=True, server_name='0.0.0.0', server_port=7000, share=False)

CUDA is being used in the docker container. The following output is from nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3050 ...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   35C    P8              5W /   35W |    3721MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      2866      G   /usr/lib/xorg/Xorg                              4MiB |
|    0   N/A  N/A   1257129      C   /usr/src/server/.venv/bin/python3            3708MiB |
+-----------------------------------------------------------------------------------------+

Output from htop just to see some more system specs in case that helps:

Here is the debug output when starting the Text Generation Inference Server:

docker-compose logs -f

tgi | 2025-02-16T19:15:18.756780Z INFO text_generation_launcher: Args {

tgi | model_id: “microsoft/Phi-3-mini-4k-instruct”,

tgi | revision: None,

tgi | validation_workers: 2,

tgi | sharded: None,

tgi | num_shard: Some(

tgi | 1,

tgi | ),

tgi | quantize: Some(

tgi | BitsandbytesNf4,

tgi | ),

tgi | speculate: None,

tgi | dtype: None,

tgi | kv_cache_dtype: None,

tgi | trust_remote_code: false,

tgi | max_concurrent_requests: 128,

tgi | max_best_of: 2,

tgi | max_stop_sequences: 4,

tgi | max_top_n_tokens: 5,

tgi | max_input_tokens: None,

tgi | max_input_length: Some(

tgi | 500,

tgi | ),

tgi | max_total_tokens: Some(

tgi | 1024,

tgi | ),

tgi | waiting_served_ratio: 0.3,

tgi | max_batch_prefill_tokens: Some(

tgi | 500,

tgi | ),

tgi | max_batch_total_tokens: None,

tgi | max_waiting_tokens: 20,

tgi | max_batch_size: None,

tgi | cuda_graphs: None,

tgi | hostname: “caf9f8f55083”,

tgi | port: 80,

tgi | shard_uds_path: “/tmp/text-generation-server”,

tgi | master_addr: “localhost”,

tgi | master_port: 29500,

tgi | huggingface_hub_cache: Some(

tgi | “/root/.cache/huggingface”,

tgi | ),

tgi | weights_cache_override: None,

tgi | disable_custom_kernels: false,

tgi | cuda_memory_fraction: 1.0,

tgi | rope_scaling: None,

tgi | rope_factor: None,

tgi | json_output: false,

tgi | otlp_endpoint: None,

tgi | otlp_service_name: “text-generation-inference.router”,

tgi | cors_allow_origin: ,

tgi | api_key: None,

tgi | watermark_gamma: None,

tgi | watermark_delta: None,

tgi | ngrok: false,

tgi | ngrok_authtoken: None,

tgi | ngrok_edge: None,

tgi | tokenizer_config_path: None,

tgi | disable_grammar_support: false,

tgi | env: false,

tgi | max_client_batch_size: 4,

tgi | lora_adapters: None,

tgi | usage_stats: On,

tgi | payload_limit: 2000000,

tgi | enable_prefill_logprobs: false,

tgi | }

Some ideas as to why it might not work locally:

Too little VRAM on my graphics card

  • I chose the model because it is the biggest that still fits (with quantization) onto my GPU. However I dont know how much VRAM is needed when actually doing inference. The model is listed as supported models by HuggingFace for the TGI.

There is some mismatch between the model, the TGI, and how IM supplying messages to the LLM.
Maybe the model expects input in a different format.

Any help or any pointers would be greatly appreciated. Thanks :slight_smile:

1 Like

I recommend Ollama for your first LLM experience. It’s easy to use, it’s fairly fast, it uses little VRAM, and it will run on a CPU alone.

If you want to do something complicated, you can use more difficult software.