Multi-gpu inference

somesaba · May 13, 2024, 11:59pm

Hello, I am trying to maximize inference speed of a single prompt on a small (7B) model. I have a server with 4 GPUs. It seems possible to use accelerate to speed up inference. Does anyone have example code? I only see examples of splitting multiple prompts across GPUs but I only have 1 prompt at a time.

Thank you.

muellerzr · May 14, 2024, 12:26am

If your model fits on a single GPU, that will always be faster than multiple GPUs as doing so would involve communication bandwidth and slow things down some

somesaba · May 14, 2024, 1:01am

Thank you. Is there any other way to improve inference speed? I’m already using an H100 with 4-bit quantization and flash attention 2.

Topic		Replies	Views
How to run inference on multigpus 🤗Accelerate	0	138	November 29, 2024
Any good code/tutorial that is shows how to do inference with Llama 2 70b on multiple GPUs with accelerate? 🤗Accelerate	1	2786	November 27, 2023
How to do distributed Inference for large models with multiprocess? 🤗Accelerate	3	636	May 26, 2024
How to generate on multiple GPU's Intermediate	3	1855	August 30, 2022
Is there any way to avoid CPU bottlenecks when doing single prompt inference? Intermediate	1	978	June 12, 2023

Multi-gpu inference

Related topics