Image to Text model that can take an additional text as input for context

Hi, does anyone can recommend an image to text model that can take an additional text input for adding context prior for generating the caption?

Anyone please? :slight_smile: