LLaVA multi-image input support for inference

Indeed, and another model which just got released which supports few-shots is Qwen2-VL, which is integrated natively in the Transformers library. See this tweet regarding sample usage.

Note that it’s a fast moving field so in about 2 weeks this model will again be surpassed by another one.