How to deploy larger model inference on multiple machine with multiple GPU？

leonard0 · August 25, 2023, 2:59am

I want use llama2-70b-hf for infrence， the total model about 133GB， Now I have 4 machines， each have 4 GPU cards， each GPU card has 16GB memory，and 4 machines are connected by IB，
the question is how to deploy these model？

enochlev · December 19, 2023, 7:03pm

You need a LLM engineer for this.

You wont be able to load 70b especially if all 4 machines are separate…
I would try some libraries such as GitHub - InternLM/lmdeploy: LMDeploy is a toolkit for compressing, deploying, and serving LLMs. which is the fastest one I saw, or GitHub - lm-sys/FastChat: An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena. which may be easier to deploy as for separate machines at once. You spin up a worker on every machine.
or with torchserve make sure your quantize the models

I would suggest llama-13b or llama-7b quantized to 8-bit. Keep in mind that you need about 2gb of VRAM for every parallel request

Topic		Replies	Views
Any good code/tutorial that is shows how to do inference with Llama 2 70b on multiple GPUs with accelerate? 🤗Accelerate	1	2782	November 27, 2023
Best way to deploy a SLM/LLM model. Best library and approach? Research	6	1004	March 11, 2025
Multiple gpu training 🤗Transformers	1	2473	August 10, 2024
Does anyone have an idea how we can run llama2 with multiple GPUs? 🤗Transformers	1	1278	October 26, 2023
Offloading LLM models to CPU uses only single core 🤗Transformers	1	4008	June 3, 2024

How to deploy larger model inference on multiple machine with multiple GPU？

Related topics