I want to deploy a SLM/LLM Model from hugging face with the least response time.
Please help me with the best library and approach for inferencing. If there is a template for production level inferencing will be helpful.
Library: Transformers with pipeline
Current inference time: ~4-5 sec for 2000 tokens.
Model to deploy: Phi 3 mini instruct ~7.5GB.
If you’re only doing inference, Llamacpp is probably more memory-efficient and faster. As for accuracy, if you use 4-bit quantization with GGUF or bitsandbytes, you’re unlikely to encounter any problems.
I’ve heard that ExLlamaV2 is good if you’re pursuing speed in the Transformers family.
Maybe PHI 3 mini is too small to quantize…
LLMs of 12B or less have a lot of variation in performance depending on size, and if it’s less than 8B, it’s already on the edge as a general-purpose model, so quantizing mini, which is smaller than that, might be difficult in terms of performance.
GPTQ and EXL2 take a long time to quantize, but they have the advantage of being accurate and fast when inferring, so it might be possible.
Well, quantization is usually a way to reduce size rather than speed up, so I think there are ways to do it without quantization.
Also, TorchAO is the official PyTorch quantization method, and in addition to being basically fast, it is advantageous for pursuing speed because it is easy to use in combination with PyTorch’s speedup techniques (such as torch.compile), but of course the version requirements for PyTorch are strict. It would be good if it worked well, but I think there are still many issues in practical use.
Hey John, thank you for these extra tips for optimization.
I will try these if possible.
What are your views on Vllm library, have you tried it?
Also, is there a specific code where I can have production ready code or docker image? If there is any. It would be really grateful of you.
Thanks!