Image to text model that can take an additional text input

Hi, does anyone can recommend an image to text model that can take an additional text input for adding context prior for generating the caption?

I think you can look at Matcha or Deplot models. You could pass in a text along with the image, but I doubt it will have any significant effect on the output, though they were trained with a text-input as well.