How can I batch LLaVa inference, so that I can use all of my GPU memory?

I am doing inference with LLaVa on 2 A100s using by iterating through my dataset and generating outputs one at a time.

model = LlavaForConditionalGeneration.from_pretrained(
        pretrained_model_name_or_path=model_id, 
        torch_dtype=torch.float16, 
        low_cpu_mem_usage=True, 
        attn_implementation='flash_attention_2',
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.float16
    )

processor = AutoProcessor.from_pretrained(pretrained_model_name_or_path=model_id)

for item in dataset:
        inputs = processor(text=prompt, images=item["image"], return_tensors='pt').to(0, torch.float16)
        output = model.generate(**inputs, max_new_tokens=200, do_sample=False)

However, this is not making the most of my 2 GPUs as it is hardly filling up the memory of the first one, how could I make the most of my GPU memory and do batch inference here? Thank you.

1 Like