LLaVA multi-image input support for inference

Hi, I wanted to explore playing with few-shot use cases with LMMs such as LLaVA on the Hub. The original authors seem to indicate that it does not perform well when using more than one image. How should I format the prompt when trying multi-image inputs, assuming the template for one image is USER: <image>\n<prompt>\nASSISTANT:? Should I simply use the template USER: <image1><image2>\n<prompt>\nASSISTANT:?

Also related, how would to feed the model multiple images for inference using the Hub pipeline? Should we stack all the images before feeding them?

Hi not that helpful but you can send 2 images but LLaVA model training did not have training data on 2 images so the result is arbitrary…
When I tried it the newest image is looked at and the previous ones were ignored.
I think you would need to retrain or fine-tune the LLaVA model which 2 image data

see this question I asked- that does show you how to feed the multiple images into the model pipeline but the author is clear they haven’t trained with any 2 image datasets
YouLiXiya/tinyllava-v1.0-1.1b-hf · The model supports multi-image and multi-prompt generation.?

Thanks! This is what I was expecting. I saw the same kind of answers from the authors on their github as well. I guess we will need to wait for LLaVA 2.0 for this :sweat_smile: (LLaVA 1.6 just came out but I do not think it was trained on multi-image).

@alazia I think you are right re llava 1.6 but you could finetune any of these models with multi-image and it would work,.there is also image stitching approach (adding mutipe images into one and the feeding thatti the network) here is an example of this Multiple image embeds in one prompt? · Issue #12 · vikhyat/moondream · GitHub which shows it working… it s very prompt dependent and this seems to be the case with many llaava models

@sujitvasanth Yeah, but right now I do not have the bandwidth to do fine-tuning of VLMs, I was hoping for a zero-shot setup. You are right, the stitching approach is an option, but remember that your image input will be rescaled to the vision encoder size (e.g. 336x336 for LLaVA 1.5), which means that trying to put 2 images inside such a small resolution will make you lose a lot of details and pixel-level features.