How can I batch LLaVa inference, so that I can use all of my GPU memory?

cgrimshawdev26 · January 8, 2024, 8:03pm

I am doing inference with LLaVa on 2 A100s using by iterating through my dataset and generating outputs one at a time.

model = LlavaForConditionalGeneration.from_pretrained(
        pretrained_model_name_or_path=model_id, 
        torch_dtype=torch.float16, 
        low_cpu_mem_usage=True, 
        attn_implementation='flash_attention_2',
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.float16
    )

processor = AutoProcessor.from_pretrained(pretrained_model_name_or_path=model_id)

for item in dataset:
        inputs = processor(text=prompt, images=item["image"], return_tensors='pt').to(0, torch.float16)
        output = model.generate(**inputs, max_new_tokens=200, do_sample=False)

However, this is not making the most of my 2 GPUs as it is hardly filling up the memory of the first one, how could I make the most of my GPU memory and do batch inference here? Thank you.

Topic		Replies	Views
GPU usage increasing every loop when running inference Beginners	2	1060	May 13, 2024
Error making predictions using LMM (LLaVA) model on multiple GPUs Intermediate	0	537	March 27, 2024
Inference on multi GPUs Research	2	223	May 1, 2025
Memory Usage for Inference Depending on Size of Input Data 🤗Transformers	1	4428	September 18, 2023
Multi-GPU inference with accelerate Beginners	0	1713	October 19, 2023

How can I batch LLaVa inference, so that I can use all of my GPU memory?

Related topics