CLIP Image to Text search

Hi! I would like to use the CLIP model for image to text search (captioning).

Given an image I would like to retrieve the most similar text in the latent space I suppose, which is one that was already in the training data right? I wouldn’t be able to retrieve a generated text from the latent space for the specific image?

How would I go about it? I guess I would need to write a decoder for that?

Best wishes