I am doing inference with LLaVa on 2 A100s using by iterating through my dataset and generating outputs one at a time.
model = LlavaForConditionalGeneration.from_pretrained(
pretrained_model_name_or_path=model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
attn_implementation='flash_attention_2',
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
processor = AutoProcessor.from_pretrained(pretrained_model_name_or_path=model_id)
for item in dataset:
inputs = processor(text=prompt, images=item["image"], return_tensors='pt').to(0, torch.float16)
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
However, this is not making the most of my 2 GPUs as it is hardly filling up the memory of the first one, how could I make the most of my GPU memory and do batch inference here? Thank you.