Hi, I wanted to explore playing with few-shot use cases with LMMs such as LLaVA on the Hub. The original authors seem to indicate that it does not perform well when using more than one image. How should I format the prompt when trying multi-image inputs, assuming the template for one image is USER: <image>\n<prompt>\nASSISTANT:
? Should I simply use the template USER: <image1><image2>\n<prompt>\nASSISTANT:
?
Also related, how would to feed the model multiple images for inference using the Hub pipeline? Should we stack all the images before feeding them?