Inference on multi GPUs

I am using 8 A6000 GPUs for a text-to-image inference task. I deployed the model across multiple GPUs using device_map="auto", but when the inference begins, an error occurs stating that GPU 0 doesn’t have enough memory. Is this a mechanism inherent to the model’s inference process, where the additional memory overhead during inference is primarily handled by the first GPU?

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
...
inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to("cuda")

    generated_ids = model.generate(**inputs, max_new_tokens=128)
    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )

It seems the memory overhead during inference is primarily handled by the first GPU, like you state, because I get the same error for another model.

Ideally, it would be possible to check available memory for all GPUs and automatically assign input processing to a GPU with enough space. However, I haven’t found a way to do this.

Instead, I make sure to load all model weights onto the other GPUs to free up memory on the first GPU. Let’s say you have 4 GPUs in total, then you can use the max_memory parameter to specify how much of the model should be uploaded to each GPU. You can then make 0GB of the model upload to the first GPU. This gives you an entirely free GPU to handle large inputs during inference.

Available memory calculation taken from: Top 4 Ways to Find Total Free and Available GPU Memory Using …

1 Like

Although malfunctions are not uncommon, using the Accelerate library makes it relatively easy to achieve multi-GPU inference. It can be called seamlessly from Transoformers and Diffusers.