Need help with deploying my model on spaces

I’ve recently fine-tuned a Gemma-2b-it model for a project and need to host it somewhere for simple demonstrations. I successfully created a gradio space with this model and its running(Demo Space - a Hugging Face Space by bhashwarsengupta). But when I give it a prompt, it just keeps on loading and eventually gives a timeout error: huggingface_hub.errors.HfHubHTTPError: 504 Server Error: Gateway Timeout for url

Here is my gradio code:

import gradio as gr
from huggingface_hub import InferenceClient

client = InferenceClient("bhashwarsengupta/gemma-finance")

print("Model Status\n")
model_status = client.get_model_status() # Get the model status
print(model_status)

context = """<system_prompt_here>"""

def respond(
    message,
    history: list[tuple[str, str]],
    system_message=context,
    max_tokens=150,
    temperature=0.7,
    top_p=0.95,
):

    print(f"Message: {message}")
    print(f"History:\n\n{history}")
    print(f"System prompt: {system_message}")
    print(f"Max Tokens: {max_tokens}")
    print(f"Temp: {temperature}")
    print(f"Top_p: {top_p}")

    messages = [{"role": "system", "content": system_message}]

    for val in history:
        if val[0]:
            messages.append({"role": "user", "content": val[0]})
        if val[1]:
            messages.append({"role": "assistant", "content": val[1]})

    messages.append({"role": "user", "content": message})

    print(f"Messages:\n\n{messages}")

    response = ""

    print("Generating response...")
    for message in client.chat_completion(
        messages,
        model="bhashwarsengupta/gemma-finance",
        max_tokens=max_tokens,
        stream=True,
        temperature=temperature,
        top_p=top_p,
    ):
        print(message)
        token = message.choices[0].delta.content

        response += token
        yield response

demo = gr.ChatInterface(
    respond,
    type="messages",
    additional_inputs=[
        gr.Textbox(value=context, label="System message"),
        gr.Slider(minimum=1, maximum=2048, value=150, step=1, label="Max new tokens"),
        gr.Slider(minimum=0.1, maximum=4.0, value=0.7, step=0.1, label="Temperature"),
        gr.Slider(
            minimum=0.1,
            maximum=1.0,
            value=0.95,
            step=0.05,
            label="Top-p (nucleus sampling)",
        ),
    ],
)


if __name__ == "__main__":
    demo.launch(
        show_error=True
    )

Upon further inspection of the above code, when I check the model status, it shows the model is not loaded but it’s in a "Loadable" state. This could be why the program gets stuck when client.chat_completion() is called.

I’m new to this and not sure how to proceed. Any insights would be greatly appreciated.

Here is my fine-tuned model repo: bhashwarsengupta/gemma-finance · Hugging Face

Regards,
Bhashwar Sengupta

1 Like

Your program is not wrong. There was one part of the model settings that did not match the HF specifications, so I have made a PR.

However, the Serverless Inference API itself is currently in a state where it can hardly be used with personal models, so it may be difficult to implement it with InferenceClient. It is difficult to use it unless the state is Warm. Only well-known models become Warm

1 Like