LLaVA multi-image input support for inference

alzaia · January 9, 2024, 4:57am

Hi, I wanted to explore playing with few-shot use cases with LMMs such as LLaVA on the Hub. The original authors seem to indicate that it does not perform well when using more than one image. How should I format the prompt when trying multi-image inputs, assuming the template for one image is USER: <image>\n<prompt>\nASSISTANT:? Should I simply use the template USER: <image1><image2>\n<prompt>\nASSISTANT:?

Also related, how would to feed the model multiple images for inference using the Hub pipeline? Should we stack all the images before feeding them?

sujitvasanth · February 4, 2024, 12:10pm

Hi not that helpful but you can send 2 images but LLaVA model training did not have training data on 2 images so the result is arbitrary…
When I tried it the newest image is looked at and the previous ones were ignored.
I think you would need to retrain or fine-tune the LLaVA model which 2 image data

see this question I asked- that does show you how to feed the multiple images into the model pipeline but the author is clear they haven’t trained with any 2 image datasets
YouLiXiya/tinyllava-v1.0-1.1b-hf · The model supports multi-image and multi-prompt generation.?

alzaia · February 6, 2024, 1:16am

Thanks! This is what I was expecting. I saw the same kind of answers from the authors on their github as well. I guess we will need to wait for LLaVA 2.0 for this (LLaVA 1.6 just came out but I do not think it was trained on multi-image).

sujitvasanth · February 6, 2024, 10:46pm

@alazia I think you are right re llava 1.6 but you could finetune any of these models with multi-image and it would work,.there is also image stitching approach (adding mutipe images into one and the feeding thatti the network) here is an example of this Multiple image embeds in one prompt? · Issue #12 · vikhyat/moondream · GitHub which shows it working… it s very prompt dependent and this seems to be the case with many llaava models

alzaia · March 6, 2024, 5:30am

@sujitvasanth Yeah, but right now I do not have the bandwidth to do fine-tuning of VLMs, I was hoping for a zero-shot setup. You are right, the stitching approach is an option, but remember that your image input will be rescaled to the vision encoder size (e.g. 336x336 for LLaVA 1.5), which means that trying to put 2 images inside such a small resolution will make you lose a lot of details and pixel-level features.

shivanis14 · August 25, 2024, 8:58am

Has anyone tried Video-LLaVA or LLaVa-NeXT-Video? If so, which out of the two was more accurate?

nielsr · August 26, 2024, 1:59pm

Hi,

The original LLaVa models were indeed not trained with interleaved images, but it’s now added in the docs on how to do that: LLaVa. The same goes for LLaVA-NeXT.

See also here where we discuss which ones could be useful for few-shot prompting.

mikehemberger · August 28, 2024, 6:04pm

Hey @alzaia ,
Or maybe consider using the Phi 3.5 vision model. Worked great for me:

The code snippets show how to add multiple images.
Best,
Mike

nielsr · August 30, 2024, 10:23am

Indeed, and another model which just got released which supports few-shots is Qwen2-VL, which is integrated natively in the Transformers library. See this tweet regarding sample usage.

Note that it’s a fast moving field so in about 2 weeks this model will again be surpassed by another one.

Topic		Replies	Views
Multimodal LLM with Image and Text sequentially in its prompt 🤗Transformers	2	12259	January 1, 2024
Turning a LLaMA model into a LLaVA Beginners	0	89	June 24, 2024
Looking information on the training set used in LLaVA Beginners	0	11	July 24, 2024
ValueError: Image features and image tokens do not match 🤗Transformers	2	1760	April 14, 2025
Error making predictions using LMM (LLaVA) model on multiple GPUs Intermediate	0	538	March 27, 2024

LLaVA multi-image input support for inference

Related topics