Hi,
I have to forward pass a lot of images through various VLMs like Paligemma, LLaVA Next, Qwen2-VL.
At first, I was running a simple for loop across images, something like:
for img, prompt in data:
conversation = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": prompt},
],
},
]
text_conversation = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=img, text=text_conversation, return_tensors="pt")
inputs = inputs.to(device)
outputs = model.generate(**inputs)
However, I noticed that this has a terrible GPU utilization. My first solution was simply to use a larger batch size but still keep the processor in the for loop. However, this didn’t really help with utilization. It seemed like my utilization had higher peaks but also longer phases at 0% utilization in between, probably because the model has to wait for the processing to happen.
My current solution uses a standard torch dataset and dataloader which allows me to have the CPU prepare the next batch while the current one is being processed on the GPU:
class DefaultEvaluatorDataset(Dataset):
def __init__(self, data_dicts: List[Dict], processor_function: ProcessorFunction):
super().__init__()
self.data_dicts = data_dicts
assert len(data_dicts) > 0
assert 'image_path' in data_dicts[0]
assert 'prompt' in data_dicts[0]
self.processor_function = processor_function
def __getitem__(self, index):
data_dict = self.data_dicts[index]
img_path = data_dict['image_path']
img = pil_loader(img_path)
prompt = data_dict['prompt']
inputs = self.processor_function(img, prompt)
return inputs
def __len__(self) -> int:
return len(self.data_dicts)
class DefaultCollate:
def __init__(self, tokenizer):
self.llm_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
def __call__(self, batch):
pixel_values = []
for indiv_inputs in batch:
pixel_values.append(indiv_inputs.data.pop('pixel_values'))
#this will break if pixel values have different dimensions, in that case, use batch size = 1
pixel_values = torch.stack(pixel_values, dim=0)
inputs = self.llm_collator(batch)
inputs.data['pixel_values'] = pixel_values
inputs = BatchFeature(inputs)
return inputs
dataloader = torch.utils.data.DataLoader(
dataset,
shuffle=shuffle,
collate_fn=collate_fn,
batch_size=batch_size,
num_workers=num_workers,
)
for inputs in data_loader:
inputs = inputs.to(model.device, model.dtype)
outputs = model(inputs, generation_kwargs)
...
processor_function here is simply a model-dependent function that takes care of the chat template and just calls the model’s processor.
This works nicely for some models where you can stack the pixel_values. However, for some models like LLaVA Next, the dimension of pixel_values depends on the input resolution. This means that at batch size > 1, this code breaks. So to do this, I would have to call the processor for the entire batch at once instead of processing each example individually and then collating the results. However, that doesn’t work so nicely with the torch dataset class which is based around processing single examples at a time.
Is there any known solution for this kind of asynchronous processing that is not too model dependent? I don’t want to reimplement half of the processor again to get proper collate, especially since models like LLava Next have quite complex processing pipelines.
Does huggingface datasets offer a solution for this?
Thank you!