Hello, congrats to all contributors for the awesome work with LXMERT! It is exciting to see multimodal transformers coming to hugginface/transformers. Of course, I immediately tried it out and to played with the demo.
Does the line
lxmert_base = LxmertForPreTraining.from_pretrained("unc-nlp/lxmert-base-uncased") load an already pre-trained LXMERT model on the tasks enumerated in the original paper “(1) masked crossmodality language modeling, (2) masked object prediction via RoI-feature regression, (3) masked object prediction via detected-label classification, (4) cross-modality matching, and (5) image question answering.” (Tan & Bansal, 2019)?
Tagging our LXMERT specialist @lysandre
This question has been answered on Github here by @eltoto1219, the author of the huggingface implementation of LXMERT.
Hello @lysandre, thanks for tagging the right person. Here is my github response with a new question:
Hello @eltoto1219, thank you for the answer! I suppose it was a weird question from my part, since I was asking this to make sure that I am loading a pre-trained LXMERT model and not some random weights. Especially because I look at the
output_lxmert['cross_relationship_score'] of COCO images and captions (so not on some out of distribution images and captions), after I loaded LXMERT with the aforementioned code
lxmert_base = LxmertForPreTraining.from_pretrained("unc-nlp/lxmert-base-uncased") . It seems that on cross-modality matching LXMERT performs with 50% accuracy (random guessing). So I wanted to make sure that I load pre-trained weights on (4) cross-modality matching in the first place.
New question : Do you know how it can be that LXMERT randomly guesses on cross-modality matching even it was pre-trained to deliver a score (after the Softmax, of course) of smaller 0.5 if the caption is not describing the image and a score bigger 0.5 it the caption and the image match?
Any visual question answering demo? Thanks.
Hi Suraj - I am looking for a good starting point to fine tune LXMERT for VQA task on a custom dataset. Could you please point me to something ? @valhalla