MMBT Model (Resnet and BERT) for multimodal embeddings

Hi! I’m trying to use the librarys implementation of Multimodal Bitransformers (Kiela et all.) on how to classify images and text simultaneously.
I’ve found it hard, because there is very little documentation, and no examples.
In particular i’ve been having a hard time figuring out how to pass the encoded image with the tokenized text to the already intialized model.
If anyone has already worked with this implementation i could really use some help


1 Like

Hey, check out this example.

1 Like

Any luck with this one? What were the results?

I tried to reproduce the code of the example and found a couple of issues:

  • in the inputs of MMBTForClassification model there should be also a key return_dict otherwise it will not work
  • depending on how you one-hot encode the labels you might want to cast the vectors as float for the loss computation

I also have a couple of questions:

  • the data loader in the example returns batches containing image_start_token and image_end_token that have been defined as the first and last special tokens of the tokenized sentence. I’m somehow puzzled with this and I’ll try to investigate further why…
  • do you know why a gradient accumulation step phase is used? is that normal that after that there is a model.zero_grad()?
  • the criterion chosen for the example is nn.BCEWithLogitsLoss(pos_weight=label_weights) and I was wondering why exactly. Is that fine for multiclass classification? What would be the equivalent of keras categorical entropy or sparse categorical entropy?