Any examples on VisualBERTforMultipleChoice

Hi,

Would anyone know any examples on how to use VisualBERTforMultipleChoice, or any similar examples? I am mostly looking for an example that can showcase how I need to tokenize my text data and perform visual feature extraction of my images, as well as how to input my multi-class labels to the model.

I would like to build something similar to this paper using radiology images and text reports and train a model to predict 14 classes (thoracic diagnosis):

Here is the hugging face transformer model I plan to use:

If someone would have a good example on how to do this with hugging face please share. Thanks!

Actually, if anyone would be able to share an example of a vision language transformer model? Preferably trained on a multi-classification problem, but any task would be helpful. Even better, if it uses the SageMaker API. Thank you!