How to deploy larger model inference on multiple machine with multiple GPU´╝č

I want use llama2-70b-hf for infrence´╝î the total model about 133GB´╝î Now I have 4 machines´╝î each have 4 GPU cards´╝î each GPU card has 16GB memory´╝îand 4 machines are connected by IB´╝î
the question is how to deploy these model´╝č

You need a LLM engineer for this.

You wont be able to load 70b especially if all 4 machines are separateÔÇŽ
I would try some libraries such as GitHub - InternLM/lmdeploy: LMDeploy is a toolkit for compressing, deploying, and serving LLMs. which is the fastest one I saw, or GitHub - lm-sys/FastChat: An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena. which may be easier to deploy as for separate machines at once. You spin up a worker on every machine.
or with torchserve make sure your quantize the models

I would suggest llama-13b or llama-7b quantized to 8-bit. Keep in mind that you need about 2gb of VRAM for every parallel request