Asking for the appropriate image-tweet combination for social media geolocation model architecture

I am trying to use social media positioning, such as post and image in Instagram. Is there any suitable model that can combine image and text as input and output corresponding to the geographical location of the sample? Image model can be VIT or ResNET, text model can be BERT, etc