LLaVA multi-image input support for inference

Hi, I wanted to explore playing with few-shot use cases with LMMs such as LLaVA on the Hub. The original authors seem to indicate that it does not perform well when using more than one image. How should I format the prompt when trying multi-image inputs, assuming the template for one image is USER: <image>\n<prompt>\nASSISTANT:? Should I simply use the template USER: <image1><image2>\n<prompt>\nASSISTANT:?

Also related, how would to feed the model multiple images for inference using the Hub pipeline? Should we stack all the images before feeding them?

1 Like

Hi not that helpful but you can send 2 images but LLaVA model training did not have training data on 2 images so the result is arbitrary…
When I tried it the newest image is looked at and the previous ones were ignored.
I think you would need to retrain or fine-tune the LLaVA model which 2 image data

see this question I asked- that does show you how to feed the multiple images into the model pipeline but the author is clear they haven’t trained with any 2 image datasets
YouLiXiya/tinyllava-v1.0-1.1b-hf · The model supports multi-image and multi-prompt generation.?

1 Like

Thanks! This is what I was expecting. I saw the same kind of answers from the authors on their github as well. I guess we will need to wait for LLaVA 2.0 for this :sweat_smile: (LLaVA 1.6 just came out but I do not think it was trained on multi-image).

@alazia I think you are right re llava 1.6 but you could finetune any of these models with multi-image and it would work,.there is also image stitching approach (adding mutipe images into one and the feeding thatti the network) here is an example of this Multiple image embeds in one prompt? · Issue #12 · vikhyat/moondream · GitHub which shows it working… it s very prompt dependent and this seems to be the case with many llaava models

@sujitvasanth Yeah, but right now I do not have the bandwidth to do fine-tuning of VLMs, I was hoping for a zero-shot setup. You are right, the stitching approach is an option, but remember that your image input will be rescaled to the vision encoder size (e.g. 336x336 for LLaVA 1.5), which means that trying to put 2 images inside such a small resolution will make you lose a lot of details and pixel-level features.

1 Like

Has anyone tried Video-LLaVA or LLaVa-NeXT-Video? If so, which out of the two was more accurate?

Hi,

The original LLaVa models were indeed not trained with interleaved images, but it’s now added in the docs on how to do that: LLaVa. The same goes for LLaVA-NeXT.

See also here where we discuss which ones could be useful for few-shot prompting.

Hey @alzaia ,
Or maybe consider using the Phi 3.5 vision model. Worked great for me:

The code snippets show how to add multiple images.
Best,
Mike

Indeed, and another model which just got released which supports few-shots is Qwen2-VL, which is integrated natively in the Transformers library. See this tweet regarding sample usage.

Note that it’s a fast moving field so in about 2 weeks this model will again be surpassed by another one.