Deploy LLM on Space with efficient GPU

Hi guys,
I mainly use HuggingFace to buid AI Agent with Streamlit and LangChain and deploy them on Space.
Today, I would like to use Space to deploy an open model (Mixtral-8x7B-Instruct-v0.1) for my client.
The goal is to make it accessible with an API (FastAPI) like other providers (OpenAI, Mistral, …)
No problem for that, but I would like to add some harware with GPU by upgrading Space.
How can I choose the right setting to don’t have an expensive invoice ?
Is that better to use an allocated hardware (more expensive, but more secure, no latency) or by using ZeroGPU subscription (latency) ?

Thanks

2 Likes

There’s a daily usage limit, but Zero GPU is overwhelmingly cost-effective and offers a flat rate instead of pay-as-you-go. The specs are basically nVidia H200, and you can use over 70GB of VRAM… Spaces also has good CPU and RAM.

However, the implementation is quite quirky, so I recommend Zero GPU only if you look at someone else’s code and think you can manage it.

For other regular PAYG GPUs, you could save money by quantizing your model and using a cheaper GPU with smaller VRAM…
There’s probably no usage cap, so if you don’t sleep or pause frequently, it will cost you…