Hey there, this is very interesting, I have some experience with NLP and computer vision, and always wanted to get more experience with multi-modal models (text + vision), also since I saw the WIT dataset for the first time, I wanted to use it for some project, this seems a good opportunity.
If you want to know a little more about my background, check out my GitHub.