How to deploy larger model inference on multiple machine with multiple GPU？

enochlev · December 19, 2023, 7:03pm

You need a LLM engineer for this.

You wont be able to load 70b especially if all 4 machines are separate…
I would try some libraries such as GitHub - InternLM/lmdeploy: LMDeploy is a toolkit for compressing, deploying, and serving LLMs. which is the fastest one I saw, or GitHub - lm-sys/FastChat: An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena. which may be easier to deploy as for separate machines at once. You spin up a worker on every machine.
or with torchserve make sure your quantize the models

I would suggest llama-13b or llama-7b quantized to 8-bit. Keep in mind that you need about 2gb of VRAM for every parallel request

Topic		Replies	Views
Any good code/tutorial that is shows how to do inference with Llama 2 70b on multiple GPUs with accelerate? 🤗Accelerate	1	2800	November 27, 2023
Best way to deploy a SLM/LLM model. Best library and approach? Research	6	1229	March 11, 2025
Multiple gpu training 🤗Transformers	1	2714	August 10, 2024
Does anyone have an idea how we can run llama2 with multiple GPUs? 🤗Transformers	1	1280	October 26, 2023
Offloading LLM models to CPU uses only single core 🤗Transformers	1	4028	June 3, 2024