LLaVA multi-image input support for inference

alzaia · January 9, 2024, 4:57am

Hi, I wanted to explore playing with few-shot use cases with LMMs such as LLaVA on the Hub. The original authors seem to indicate that it does not perform well when using more than one image. How should I format the prompt when trying multi-image inputs, assuming the template for one image is USER: <image>\n<prompt>\nASSISTANT:? Should I simply use the template USER: <image1><image2>\n<prompt>\nASSISTANT:?

Also related, how would to feed the model multiple images for inference using the Hub pipeline? Should we stack all the images before feeding them?

Topic		Replies	Views
Multimodal LLM with Image and Text sequentially in its prompt 🤗Transformers	2	12457	January 1, 2024
Turning a LLaMA model into a LLaVA Beginners	0	90	June 24, 2024
Looking information on the training set used in LLaVA Beginners	0	11	July 24, 2024
ValueError: Image features and image tokens do not match 🤗Transformers	2	2156	April 14, 2025
Error making predictions using LMM (LLaVA) model on multiple GPUs Intermediate	0	542	March 27, 2024

LLaVA multi-image input support for inference

Related topics