Need help with deploying my model on spaces

bhashwarsengupta · November 21, 2024, 12:59pm

I’ve recently fine-tuned a Gemma-2b-it model for a project and need to host it somewhere for simple demonstrations. I successfully created a gradio space with this model and its running(Demo Space - a Hugging Face Space by bhashwarsengupta). But when I give it a prompt, it just keeps on loading and eventually gives a timeout error: huggingface_hub.errors.HfHubHTTPError: 504 Server Error: Gateway Timeout for url

Here is my gradio code:

import gradio as gr
from huggingface_hub import InferenceClient

client = InferenceClient("bhashwarsengupta/gemma-finance")

print("Model Status\n")
model_status = client.get_model_status() # Get the model status
print(model_status)

context = """<system_prompt_here>"""

def respond(
    message,
    history: list[tuple[str, str]],
    system_message=context,
    max_tokens=150,
    temperature=0.7,
    top_p=0.95,
):

    print(f"Message: {message}")
    print(f"History:\n\n{history}")
    print(f"System prompt: {system_message}")
    print(f"Max Tokens: {max_tokens}")
    print(f"Temp: {temperature}")
    print(f"Top_p: {top_p}")

    messages = [{"role": "system", "content": system_message}]

    for val in history:
        if val[0]:
            messages.append({"role": "user", "content": val[0]})
        if val[1]:
            messages.append({"role": "assistant", "content": val[1]})

    messages.append({"role": "user", "content": message})

    print(f"Messages:\n\n{messages}")

    response = ""

    print("Generating response...")
    for message in client.chat_completion(
        messages,
        model="bhashwarsengupta/gemma-finance",
        max_tokens=max_tokens,
        stream=True,
        temperature=temperature,
        top_p=top_p,
    ):
        print(message)
        token = message.choices[0].delta.content

        response += token
        yield response

demo = gr.ChatInterface(
    respond,
    type="messages",
    additional_inputs=[
        gr.Textbox(value=context, label="System message"),
        gr.Slider(minimum=1, maximum=2048, value=150, step=1, label="Max new tokens"),
        gr.Slider(minimum=0.1, maximum=4.0, value=0.7, step=0.1, label="Temperature"),
        gr.Slider(
            minimum=0.1,
            maximum=1.0,
            value=0.95,
            step=0.05,
            label="Top-p (nucleus sampling)",
        ),
    ],
)


if __name__ == "__main__":
    demo.launch(
        show_error=True
    )

Upon further inspection of the above code, when I check the model status, it shows the model is not loaded but it’s in a "Loadable" state. This could be why the program gets stuck when client.chat_completion() is called.

I’m new to this and not sure how to proceed. Any insights would be greatly appreciated.

Here is my fine-tuned model repo: bhashwarsengupta/gemma-finance · Hugging Face

Regards,
Bhashwar Sengupta

John6666 · November 21, 2024, 3:12pm

Your program is not wrong. There was one part of the model settings that did not match the HF specifications, so I have made a PR.

However, the Serverless Inference API itself is currently in a state where it can hardly be used with personal models, so it may be difficult to implement it with InferenceClient. It is difficult to use it unless the state is Warm. Only well-known models become Warm…

Topic		Replies	Views
Deploy model on HF Space for production Spaces	0	1006	March 11, 2022
Deploying models onto spaces Beginners	2	1339	August 1, 2023
Using gpt-j-6B in a CPU space without the InferenceAPI Spaces	0	2297	January 28, 2022
Timeout while deploying my model Beginners	0	624	February 11, 2023
Space is displaying infinitely loading while status is "Running" Spaces	1	2026	June 1, 2022

Need help with deploying my model on spaces

Related topics