Best way to deploy a SLM/LLM model. Best library and approach?

I want to deploy a SLM/LLM Model from hugging face with the least response time.
Please help me with the best library and approach for inferencing. If there is a template for production level inferencing will be helpful.

Library: Transformers with pipeline
Current inference time: ~4-5 sec for 2000 tokens.
Model to deploy: Phi 3 mini instruct ~7.5GB.

System Configuation:
Nvidia 24 GB GPU
400 GB RAM

1 Like

If you’re only doing inference, Llamacpp is probably more memory-efficient and faster. As for accuracy, if you use 4-bit quantization with GGUF or bitsandbytes, you’re unlikely to encounter any problems.

I’ve heard that ExLlamaV2 is good if you’re pursuing speed in the Transformers family.

1 Like

Hi John, thank you for responding with this information.
However, I have already tried llama-cpp-python library for inferencing GGUF format model.

With GPU, the response time was ~2sec for 2.3GB Phi 3 mini 4k model.
But accuracy was terrible as I need to extract some facts from ~2000 tokens text.

And when I switched to latest 9GB Phi 4 gguf model. It was accurate but it took ~4sec to respond.

1 Like

Maybe PHI 3 mini is too small to quantize…
LLMs of 12B or less have a lot of variation in performance depending on size, and if it’s less than 8B, it’s already on the edge as a general-purpose model, so quantizing mini, which is smaller than that, might be difficult in terms of performance.
GPTQ and EXL2 take a long time to quantize, but they have the advantage of being accurate and fast when inferring, so it might be possible.

Well, quantization is usually a way to reduce size rather than speed up, so I think there are ways to do it without quantization.

If you want to increase speed, I think Flash Attention is also well known.
https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/

Also, TorchAO is the official PyTorch quantization method, and in addition to being basically fast, it is advantageous for pursuing speed because it is easy to use in combination with PyTorch’s speedup techniques (such as torch.compile), but of course the version requirements for PyTorch are strict. It would be good if it worked well, but I think there are still many issues in practical use.

Hey John, thank you for these extra tips for optimization.
I will try these if possible.

What are your views on Vllm library, have you tried it?
Also, is there a specific code where I can have production ready code or docker image? If there is any. It would be really grateful of you.
Thanks!

1 Like