What Open-Source Multi-Modal Language Models Can Take Multiple Videos as Input?

mab · January 22, 2025, 10:23am

I am looking for multi-modal language models, like Qwen-2VL, that can accept multiple videos as input, allowing me to reference each video individually. For instance, if the input consists of four videos, I want to be able to refer to each one separately, rather than combining all four into a single long video. Do you know of any other open-source models capable of this? Additionally, links or resources for implementing them would be greatly appreciated.

Topic		Replies	Views
Multimodal Transformers with signal inputs Beginners	0	91	May 9, 2024
Are there any multi modal LLMs which are open sourced? 🤗Transformers	2	2822	July 11, 2023
Multimodal datasets and corresponding models Beginners	2	86	March 12, 2025
Video Classification Research	0	820	May 16, 2022
Multilingual Image Captioning Flax/JAX Projects	10	1308	July 6, 2021

What Open-Source Multi-Modal Language Models Can Take Multiple Videos as Input?

Related topics