Hello, I am trying to maximize inference speed of a single prompt on a small (7B) model. I have a server with 4 GPUs. It seems possible to use accelerate to speed up inference. Does anyone have example code? I only see examples of splitting multiple prompts across GPUs but I only have 1 prompt at a …

Multi-gpu inference

somesaba May 14, 2024, 1:01am 3

Thank you. Is there any other way to improve inference speed? I’m already using an H100 with 4-bit quantization and flash attention 2.

Topic		Replies	Views
How to run inference on multigpus 🤗Accelerate	0	146	November 29, 2024
Any good code/tutorial that is shows how to do inference with Llama 2 70b on multiple GPUs with accelerate? 🤗Accelerate	1	2801	November 27, 2023
How to do distributed Inference for large models with multiprocess? 🤗Accelerate	3	643	May 26, 2024
How to generate on multiple GPU's Intermediate	3	1872	August 30, 2022
Is there any way to avoid CPU bottlenecks when doing single prompt inference? Intermediate	1	985	June 12, 2023