Hi, I am new to multimodal models. I would like to understand the differences between models tagged as Visual Question Answering and those tagged as Image-Text-to-Text. From what i understand the IT2T models are used more to caption and describe images and VQA models are used for questions directed at specific aspect in an image. But I can’t draw a clear line between the two categories. What stops from prompting an IT2T model like I would a VQA model and vice versa?
2 Likes
Image text to text focuses on extracting text from images, while Visual Question Answering emphasizes understanding the image and answering questions based on it.
1 Like