I am looking for multi-modal language models, like Qwen-2VL, that can accept multiple videos as input, allowing me to reference each video individually. For instance, if the input consists of four videos, I want to be able to refer to each one separately, rather than combining all four into a single long video. Do you know of any other open-source models capable of this? Additionally, links or resources for implementing them would be greatly appreciated.
1 Like