LXMERT pre-trained model

Hello, congrats to all contributors for the awesome work with LXMERT! It is exciting to see multimodal transformers coming to hugginface/transformers. Of course, I immediately tried it out and to played with the demo.

Question:
Does the line lxmert_base = LxmertForPreTraining.from_pretrained("unc-nlp/lxmert-base-uncased") load an already pre-trained LXMERT model on the tasks enumerated in the original paper “(1) masked crossmodality language modeling, (2) masked object prediction via RoI-feature regression, (3) masked object prediction via detected-label classification, (4) cross-modality matching, and (5) image question answering.” (Tan & Bansal, 2019)?

Tagging our LXMERT specialist @lysandre

This question has been answered on Github here by @eltoto1219, the author of the huggingface implementation of LXMERT.

1 Like

Hello @lysandre, thanks for tagging the right person. Here is my github response with a new question:

Hello @eltoto1219, thank you for the answer! I suppose it was a weird question from my part, since I was asking this to make sure that I am loading a pre-trained LXMERT model and not some random weights. Especially because I look at the output_lxmert['cross_relationship_score'] of COCO images and captions (so not on some out of distribution images and captions), after I loaded LXMERT with the aforementioned code lxmert_base = LxmertForPreTraining.from_pretrained("unc-nlp/lxmert-base-uncased") . It seems that on cross-modality matching LXMERT performs with 50% accuracy (random guessing). So I wanted to make sure that I load pre-trained weights on (4) cross-modality matching in the first place.

New question : Do you know how it can be that LXMERT randomly guesses on cross-modality matching even it was pre-trained to deliver a score (after the Softmax, of course) of smaller 0.5 if the caption is not describing the image and a score bigger 0.5 it the caption and the image match?

1 Like

Any visual question answering demo? Thanks.

Yes! Here it is

Hi Suraj - I am looking for a good starting point to fine tune LXMERT for VQA task on a custom dataset. Could you please point me to something ? @valhalla