I have a use case where I want to predict an image based on some text.
For the images, I have the image itself + tags.
An example use case is that you start with images of emojis + tags of those emoji’s.
The result would be that the model suggests 1 or more emojis based on some sentence that the user provides.
I do not have any training data. The idea is that I embed the images + tags into a model and test with sentences.
What would be a good starting point for embedding the images + tags in a model?
I was thinking about multimodal modeling using Perceiver IO? Transformers-Tutorials/Perceiver_for_Multimodal_Autoencoding.ipynb at master · NielsRogge/Transformers-Tutorials · GitHub
Any other suggestions?
Thanks. Vincent