Hi, does anyone can recommend an image to text model that can take an additional text input for adding context prior for generating the caption?
I think you can look at Matcha or Deplot models. You could pass in a text along with the image, but I doubt it will have any significant effect on the output, though they were trained with a text-input as well.