The fastest LLM inference on the server

GreyWolfBanana · August 8, 2024, 8:17am

Hello everyone! I am new to LLM rollout. I am using hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 model, but I can change it to a similar one if necessary. Please tell me what state of the art technologies exist now that will allow me to get the fastest inference, considering that I am rolling out the model on the server and it should respond to several people at once and do it quickly. I am running it on A100 80 GB. Currently I use vLLM to run with these parameters:

CUDA_VISIBLE_DEVICES=0 vllm serve hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 --quantization awq --tensor-parallel-size 1 --max-model-len 4096 --host 0.0.0.0 --port 8080 --rope-scaling='{"type": "dynamic", "factor": 8.0, "low_freq_factor": 1.0, "high_freq_factor": 4.0, "original_max_position_embeddings": 8192}'

With this approach, with a load of 1 request per second, each request generates 128 new tokens in 2 seconds.
Maybe there are better ways? Maybe vLLM is launched somehow incorrectly? Maybe it is possible to somehow compile another model, but I do not know the necessary tools for this? I will be very glad if you help me.

Topic		Replies	Views
How to speed up CodeLlama inference? Beginners	0	77	July 12, 2024
Serving AWQ models without a custom container Inference Endpoints on the Hub	2	241	November 13, 2023
Best way to deploy a SLM/LLM model. Best library and approach? Research	6	1373	March 11, 2025
Inference speed Spaces	0	376	September 17, 2023
How to deploy larger model inference on multiple machine with multiple GPU？ 🤗Transformers	1	2628	December 19, 2023

The fastest LLM inference on the server

Related topics