MMBT Model (Resnet and BERT) for multimodal embeddings

tomasmuller · December 17, 2020, 7:02pm

Hi! I’m trying to use the librarys implementation of Multimodal Bitransformers (Kiela et all.) on how to classify images and text simultaneously.
I’ve found it hard, because there is very little documentation, and no examples.
In particular i’ve been having a hard time figuring out how to pass the encoded image with the tokenized text to the already intialized model.
If anyone has already worked with this implementation i could really use some help

Thanks

mike666234 · December 24, 2020, 4:14pm

Hey, check out this example.

johnrodriguez190380 · October 21, 2021, 7:31am

Any luck with this one? What were the results?

MrBruco · November 10, 2021, 9:40am

I tried to reproduce the code of the example and found a couple of issues:

in the inputs of MMBTForClassification model there should be also a key return_dict otherwise it will not work
depending on how you one-hot encode the labels you might want to cast the vectors as float for the loss computation

I also have a couple of questions:

the data loader in the example returns batches containing image_start_token and image_end_token that have been defined as the first and last special tokens of the tokenized sentence. I’m somehow puzzled with this and I’ll try to investigate further why…
do you know why a gradient accumulation step phase is used? is that normal that after that there is a model.zero_grad()?
the criterion chosen for the example is nn.BCEWithLogitsLoss(pos_weight=label_weights) and I was wondering why exactly. Is that fine for multiclass classification? What would be the equivalent of keras categorical entropy or sparse categorical entropy?

Topic		Replies	Views
Machine Translation using Hugging Face problem Intermediate	0	323	May 8, 2023
Img2seq model with pretrained weights Beginners	7	1215	November 18, 2021
Saving Manually Resized Embeddings for a Pretrained Bert Model (I believe I am asking this correctly) Beginners	0	107	November 7, 2024
Multi-task learning for masked language modeling and token classification 🤗Transformers	0	602	October 5, 2021
Predicting images based on a sentence, the unsupervised way Beginners	0	339	June 3, 2022

MMBT Model (Resnet and BERT) for multimodal embeddings

Related topics