Difference between VQA vs Image-Text-to-Text?

Hi, I am new to multimodal models. I would like to understand the differences between models tagged as Visual Question Answering and those tagged as Image-Text-to-Text. From what i understand the IT2T models are used more to caption and describe images and VQA models are used for questions directed at specific aspect in an image. But I can’t draw a clear line between the two categories. What stops from prompting an IT2T model like I would a VQA model and vice versa?

2 Likes

Image text to text focuses on extracting text from images, while Visual Question Answering emphasizes understanding the image and answering questions based on it.

1 Like