Best way to deploy a SLM/LLM model. Best library and approach?

aakashnakarmi · February 2, 2025, 6:27am

I want to deploy a SLM/LLM Model from hugging face with the least response time.
Please help me with the best library and approach for inferencing. If there is a template for production level inferencing will be helpful.

Library: Transformers with pipeline
Current inference time: ~4-5 sec for 2000 tokens.
Model to deploy: Phi 3 mini instruct ~7.5GB.

System Configuation:
Nvidia 24 GB GPU
400 GB RAM

John6666 · February 2, 2025, 6:53am

If you’re only doing inference, Llamacpp is probably more memory-efficient and faster. As for accuracy, if you use 4-bit quantization with GGUF or bitsandbytes, you’re unlikely to encounter any problems.

I’ve heard that ExLlamaV2 is good if you’re pursuing speed in the Transformers family.

aakashnakarmi · February 2, 2025, 7:23am

Hi John, thank you for responding with this information.
However, I have already tried llama-cpp-python library for inferencing GGUF format model.

With GPU, the response time was ~2sec for 2.3GB Phi 3 mini 4k model.
But accuracy was terrible as I need to extract some facts from ~2000 tokens text.

And when I switched to latest 9GB Phi 4 gguf model. It was accurate but it took ~4sec to respond.

John6666 · February 2, 2025, 7:50am

Maybe PHI 3 mini is too small to quantize…
LLMs of 12B or less have a lot of variation in performance depending on size, and if it’s less than 8B, it’s already on the edge as a general-purpose model, so quantizing mini, which is smaller than that, might be difficult in terms of performance.
GPTQ and EXL2 take a long time to quantize, but they have the advantage of being accurate and fast when inferring, so it might be possible.

Well, quantization is usually a way to reduce size rather than speed up, so I think there are ways to do it without quantization.

If you want to increase speed, I think Flash Attention is also well known.
https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/

John6666 · February 2, 2025, 7:57am

Also, TorchAO is the official PyTorch quantization method, and in addition to being basically fast, it is advantageous for pursuing speed because it is easy to use in combination with PyTorch’s speedup techniques (such as torch.compile), but of course the version requirements for PyTorch are strict. It would be good if it worked well, but I think there are still many issues in practical use.

aakashnakarmi · February 2, 2025, 12:19pm

Hey John, thank you for these extra tips for optimization.
I will try these if possible.

What are your views on Vllm library, have you tried it?
Also, is there a specific code where I can have production ready code or docker image? If there is any. It would be really grateful of you.
Thanks!

Kenv345 · March 11, 2025, 5:05am

Hugging Face Transformers
If you’re working with pre-trained models like GPT, BERT, or T5, Hugging Face is your best friend. It’s user-friendly, well-documented, and has a huge community. Plus, their Inference API makes deployment a breeze.
PyTorch
For those who prefer flexibility and control, PyTorch is a great choice. It’s widely used in research and production, especially for custom models.
TensorFlow and TensorFlow Serving
TensorFlow is another solid option, and TensorFlow Serving is specifically designed for deploying models in production environments.
FastAPI
If you’re building an API to serve your model, FastAPI is lightweight, fast, and easy to use. It’s perfect for creating RESTful endpoints.
ONNX Runtime
If you need to optimize your model for different platforms, ONNX Runtime is a great tool. It helps you convert models into a universal format for faster inference. Explore Zefyron’s AI Solutions

Topic		Replies	Views
The fastest LLM inference on the server Research	0	406	August 8, 2024
Llama3 so much slow compared to ollama 🤗Transformers	15	10096	February 28, 2025
How to deploy larger model inference on multiple machine with multiple GPU？ 🤗Transformers	1	2560	December 19, 2023
Deploying inference model size and performance 🤗Transformers	6	5200	July 9, 2024
Deploying open llm - google/flan-t5-large model on AWS inferentia2 Amazon SageMaker	0	441	September 14, 2023

Best way to deploy a SLM/LLM model. Best library and approach?

Related topics