LLaVA multi-image input support for inference

Thanks! This is what I was expecting. I saw the same kind of answers from the authors on their github as well. I guess we will need to wait for LLaVA 2.0 for this :sweat_smile: (LLaVA 1.6 just came out but I do not think it was trained on multi-image).