Use VisionTextDualEncoder for image-text retrieval

Is there a tutorial on using VisionTextDualEncoder together with a contrastive image-text loss?

My use case: I would like to train a network to put images and tags of those into the same latent space for image retrieval. Once I have the model trained I would like to use it for image/tag retrieval - image as a query and get tags that similar images have (k nearest neighbors) and the other way around.

Now I am not sure how exactly to train such a model?

I probably could do that with CLIP? Although I don’t need semantic information, just image to tag/keyword.