Hi, I wanted to explore playing with few-shot use cases with LMMs such as LLaVA on the Hub. The original authors seem to indicate that it does not perform well when using more than one image. How should I format the prompt when trying multi-image inputs, assuming the template for one image is USER: <image>\n<prompt>\nASSISTANT:? Should I simply use the template USER: <image1><image2>\n<prompt>\nASSISTANT:?
Also related, how would to feed the model multiple images for inference using the Hub pipeline? Should we stack all the images before feeding them?
Hi not that helpful but you can send 2 images but LLaVA model training did not have training data on 2 images so the result is arbitrary…
When I tried it the newest image is looked at and the previous ones were ignored.
I think you would need to retrain or fine-tune the LLaVA model which 2 image data
Thanks! This is what I was expecting. I saw the same kind of answers from the authors on their github as well. I guess we will need to wait for LLaVA 2.0 for this (LLaVA 1.6 just came out but I do not think it was trained on multi-image).
@alazia I think you are right re llava 1.6 but you could finetune any of these models with multi-image and it would work,.there is also image stitching approach (adding mutipe images into one and the feeding thatti the network) here is an example of this Multiple image embeds in one prompt? · Issue #12 · vikhyat/moondream · GitHub which shows it working… it s very prompt dependent and this seems to be the case with many llaava models
@sujitvasanth Yeah, but right now I do not have the bandwidth to do fine-tuning of VLMs, I was hoping for a zero-shot setup. You are right, the stitching approach is an option, but remember that your image input will be rescaled to the vision encoder size (e.g. 336x336 for LLaVA 1.5), which means that trying to put 2 images inside such a small resolution will make you lose a lot of details and pixel-level features.
The original LLaVa models were indeed not trained with interleaved images, but it’s now added in the docs on how to do that: LLaVa. The same goes for LLaVA-NeXT.
See also here where we discuss which ones could be useful for few-shot prompting.
Indeed, and another model which just got released which supports few-shots is Qwen2-VL, which is integrated natively in the Transformers library. See this tweet regarding sample usage.
Note that it’s a fast moving field so in about 2 weeks this model will again be surpassed by another one.