I’ve recently fine-tuned a Gemma-2b-it model for a project and need to host it somewhere for simple demonstrations. I successfully created a gradio space with this model and its running(Demo Space - a Hugging Face Space by bhashwarsengupta). But when I give it a prompt, it just keeps on loading and eventually gives a timeout error: huggingface_hub.errors.HfHubHTTPError: 504 Server Error: Gateway Timeout for url
Here is my gradio code:
import gradio as gr
from huggingface_hub import InferenceClient
client = InferenceClient("bhashwarsengupta/gemma-finance")
print("Model Status\n")
model_status = client.get_model_status() # Get the model status
print(model_status)
context = """<system_prompt_here>"""
def respond(
message,
history: list[tuple[str, str]],
system_message=context,
max_tokens=150,
temperature=0.7,
top_p=0.95,
):
print(f"Message: {message}")
print(f"History:\n\n{history}")
print(f"System prompt: {system_message}")
print(f"Max Tokens: {max_tokens}")
print(f"Temp: {temperature}")
print(f"Top_p: {top_p}")
messages = [{"role": "system", "content": system_message}]
for val in history:
if val[0]:
messages.append({"role": "user", "content": val[0]})
if val[1]:
messages.append({"role": "assistant", "content": val[1]})
messages.append({"role": "user", "content": message})
print(f"Messages:\n\n{messages}")
response = ""
print("Generating response...")
for message in client.chat_completion(
messages,
model="bhashwarsengupta/gemma-finance",
max_tokens=max_tokens,
stream=True,
temperature=temperature,
top_p=top_p,
):
print(message)
token = message.choices[0].delta.content
response += token
yield response
demo = gr.ChatInterface(
respond,
type="messages",
additional_inputs=[
gr.Textbox(value=context, label="System message"),
gr.Slider(minimum=1, maximum=2048, value=150, step=1, label="Max new tokens"),
gr.Slider(minimum=0.1, maximum=4.0, value=0.7, step=0.1, label="Temperature"),
gr.Slider(
minimum=0.1,
maximum=1.0,
value=0.95,
step=0.05,
label="Top-p (nucleus sampling)",
),
],
)
if __name__ == "__main__":
demo.launch(
show_error=True
)
Upon further inspection of the above code, when I check the model status, it shows the model is not loaded but it’s in a "Loadable"
state. This could be why the program gets stuck when client.chat_completion()
is called.
I’m new to this and not sure how to proceed. Any insights would be greatly appreciated.
Here is my fine-tuned model repo: bhashwarsengupta/gemma-finance · Hugging Face
Regards,
Bhashwar Sengupta