How do we use LXMERT for inference?

I see that we can download the LXMERT pretrained model.

Great - so how do we actually load images and start asking questions? :slight_smile:

I was looking into LXMERT too and wondering whether it might be used for image captioning. Has anyone tried something like that? Are there tutorials on how to use it for inference?

Did you manage to use this model for captioning?