Hello everyone! I am new to LLM rollout. I am using hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 model, but I can change it to a similar one if necessary. Please tell me what state of the art technologies exist now that will allow me to get the fastest inference, considering that I am rolling out the model on the server and it should respond to several people at once and do it quickly. I am running it on A100 80 GB. Currently I use vLLM to run with these parameters:
CUDA_VISIBLE_DEVICES=0 vllm serve hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 --quantization awq --tensor-parallel-size 1 --max-model-len 4096 --host 0.0.0.0 --port 8080 --rope-scaling='{"type": "dynamic", "factor": 8.0, "low_freq_factor": 1.0, "high_freq_factor": 4.0, "original_max_position_embeddings": 8192}'
With this approach, with a load of 1 request per second, each request generates 128 new tokens in 2 seconds.
Maybe there are better ways? Maybe vLLM is launched somehow incorrectly? Maybe it is possible to somehow compile another model, but I do not know the necessary tools for this? I will be very glad if you help me.