Hi all,
I’ve read papers that used multimodal models with both text and video/image together as inputs for VLMs but I’m not sure whether we can also have signals data (e.g., a sensor data) as an input along with text and image/video (as the third input) or not! I would like to hear your idea.