Predicting images based on a sentence, the unsupervised way

I have a use case where I want to predict an image based on some text.
For the images, I have the image itself + tags.

An example use case is that you start with images of emojis + tags of those emoji’s.
The result would be that the model suggests 1 or more emojis based on some sentence that the user provides.

I do not have any training data. The idea is that I embed the images + tags into a model and test with sentences.

What would be a good starting point for embedding the images + tags in a model?
I was thinking about multimodal modeling using Perceiver IO? Transformers-Tutorials/Perceiver_for_Multimodal_Autoencoding.ipynb at master · NielsRogge/Transformers-Tutorials · GitHub

Any other suggestions?

Thanks. Vincent