Hi, I wanted to explore playing with few-shot use cases with LMMs such as LLaVA on the Hub. The original authors seem to indicate that it does not perform well when using more than one image. How should I format the prompt when trying multi-image inputs, assuming the template for one image is USER: <…

LLaVA multi-image input support for inference

mikehemberger August 28, 2024, 6:04pm 8

Hey @alzaia ,
Or maybe consider using the Phi 3.5 vision model. Worked great for me:

The code snippets show how to add multiple images.
Best,
Mike

Topic		Replies	Views
Multimodal LLM with Image and Text sequentially in its prompt 🤗Transformers	2	12446	January 1, 2024
Turning a LLaMA model into a LLaVA Beginners	0	90	June 24, 2024
Looking information on the training set used in LLaVA Beginners	0	11	July 24, 2024
ValueError: Image features and image tokens do not match 🤗Transformers	2	2117	April 14, 2025
Error making predictions using LMM (LLaVA) model on multiple GPUs Intermediate	0	542	March 27, 2024