Multi-gpu inference

Hello, I am trying to maximize inference speed of a single prompt on a small (7B) model. I have a server with 4 GPUs. It seems possible to use accelerate to speed up inference. Does anyone have example code? I only see examples of splitting multiple prompts across GPUs but I only have 1 prompt at a time.

Thank you.

If your model fits on a single GPU, that will always be faster than multiple GPUs as doing so would involve communication bandwidth and slow things down some

1 Like

Thank you. Is there any other way to improve inference speed? I’m already using an H100 with 4-bit quantization and flash attention 2.