Inference on multi GPUs

Lanbai44 · September 20, 2024, 8:20am

I am using 8 A6000 GPUs for a text-to-image inference task. I deployed the model across multiple GPUs using device_map="auto", but when the inference begins, an error occurs stating that GPU 0 doesn’t have enough memory. Is this a mechanism inherent to the model’s inference process, where the additional memory overhead during inference is primarily handled by the first GPU?

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
...
inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to("cuda")

    generated_ids = model.generate(**inputs, max_new_tokens=128)
    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )

sulitelma · May 1, 2025, 10:46am

It seems the memory overhead during inference is primarily handled by the first GPU, like you state, because I get the same error for another model.

Ideally, it would be possible to check available memory for all GPUs and automatically assign input processing to a GPU with enough space. However, I haven’t found a way to do this.

Instead, I make sure to load all model weights onto the other GPUs to free up memory on the first GPU. Let’s say you have 4 GPUs in total, then you can use the max_memory parameter to specify how much of the model should be uploaded to each GPU. You can then make 0GB of the model upload to the first GPU. This gives you an entirely free GPU to handle large inputs during inference.

Available memory calculation taken from: Top 4 Ways to Find Total Free and Available GPU Memory Using …

John6666 · May 1, 2025, 12:48pm

Although malfunctions are not uncommon, using the Accelerate library makes it relatively easy to achieve multi-GPU inference. It can be called seamlessly from Transoformers and Diffusers.

Topic		Replies	Views
How can I make use of GPU manually to run inference faster? 🤗Transformers	3	35	April 22, 2025
Why am I out of GPU memory despite using device_map="auto"? 🤗Accelerate	3	18069	March 18, 2024
How to specify the gpu number to load the input during the inference of huggingface pipeline in a multi-gpu setup? 🤗Transformers	2	579	August 8, 2024
Multi-GPU inference with accelerate Beginners	0	1721	October 19, 2023
Accelarator can't detect my GPUs? 🤗Accelerate	10	1581	March 29, 2024

Inference on multi GPUs

Related topics