Image to text model that can take an additional text input

Hi, does anyone can recommend an image to text model that can take an additional text input for adding context prior for generating the caption?