Hi! I would like to use the CLIP model for image to text search (captioning).
Given an image I would like to retrieve the most similar text in the latent space I suppose, which is one that was already in the training data right? I wouldn’t be able to retrieve a generated text from the latent space for the specific image?
How would I go about it? I guess I would need to write a decoder for that?
Best wishes