Hi,
Would anyone know any examples on how to use VisualBERTforMultipleChoice, or any similar examples? I am mostly looking for an example that can showcase how I need to tokenize my text data and perform visual feature extraction of my images, as well as how to input my multi-class labels to the model.
I would like to build something similar to this paper using radiology images and text reports and train a model to predict 14 classes (thoracic diagnosis):
Here is the hugging face transformer model I plan to use:
If someone would have a good example on how to do this with hugging face please share. Thanks!