I want to deploy a SLM/LLM Model from hugging face with the least response time.
Please help me with the best library and approach for inferencing. If there is a template for production level inferencing will be helpful.
Library: Transformers with pipeline
Current inference time: ~4-5 sec for 2000 tokens.
Model to deploy: Phi 3 mini instruct ~7.5GB.
If you’re only doing inference, Llamacpp is probably more memory-efficient and faster. As for accuracy, if you use 4-bit quantization with GGUF or bitsandbytes, you’re unlikely to encounter any problems.
I’ve heard that ExLlamaV2 is good if you’re pursuing speed in the Transformers family.
Maybe PHI 3 mini is too small to quantize…
LLMs of 12B or less have a lot of variation in performance depending on size, and if it’s less than 8B, it’s already on the edge as a general-purpose model, so quantizing mini, which is smaller than that, might be difficult in terms of performance.
GPTQ and EXL2 take a long time to quantize, but they have the advantage of being accurate and fast when inferring, so it might be possible.
Well, quantization is usually a way to reduce size rather than speed up, so I think there are ways to do it without quantization.
Also, TorchAO is the official PyTorch quantization method, and in addition to being basically fast, it is advantageous for pursuing speed because it is easy to use in combination with PyTorch’s speedup techniques (such as torch.compile), but of course the version requirements for PyTorch are strict. It would be good if it worked well, but I think there are still many issues in practical use.
Hey John, thank you for these extra tips for optimization.
I will try these if possible.
What are your views on Vllm library, have you tried it?
Also, is there a specific code where I can have production ready code or docker image? If there is any. It would be really grateful of you.
Thanks!
Hugging Face Transformers
If you’re working with pre-trained models like GPT, BERT, or T5, Hugging Face is your best friend. It’s user-friendly, well-documented, and has a huge community. Plus, their Inference API makes deployment a breeze.
PyTorch
For those who prefer flexibility and control, PyTorch is a great choice. It’s widely used in research and production, especially for custom models.
TensorFlow and TensorFlow Serving
TensorFlow is another solid option, and TensorFlow Serving is specifically designed for deploying models in production environments.
FastAPI
If you’re building an API to serve your model, FastAPI is lightweight, fast, and easy to use. It’s perfect for creating RESTful endpoints.
ONNX Runtime
If you need to optimize your model for different platforms, ONNX Runtime is a great tool. It helps you convert models into a universal format for faster inference. Explore Zefyron’s AI Solutions