Img2seq model with pretrained weights

Hi there,

I’m very new to this transformers business, and have been playing around with the HF code to learn things as I tinker.

the context:
One thing I would like to do is build an encoder/decoder out of a CNN and a Transformer Decoder, for generating text from images. My wife likes to pretend that she’s seen movies, but when prompted for the plot she just summarizes contextual clues she gets from the movie posters. I want to see if I can get a transformer to do the same thing.

I have looked at what’s out there on the web, and most of the decoders I find for this kind of task are based on recurrent networks. Instead, I would like to adapt a pretrained transformer model to do the same thing.

the question:
Given a pre-trained CNN encoder, what would be the best way to extract the decoder from a pre-trained GPT/BERT model? I would ideally like to fine-tune something that’s already in a good place to begin with. I’m working with limited computational resources (a pair of consumer GPUs with about 12gb of vram in total) and a small training dataset (a few thousand movie synopses with corresponding images).


1 Like

Interesting project. My suggestion would be to take the Transformer based ViT and merge that with a decoder as a sequence to sequence function but with cross attention.

You can do this easily now, as I’ve recently added a generic VisionEncoderDecoder model class. It allows you to mix-and-match any vision Transformer encoder (such as ViT, DeiT, BEiT) with any text Transformer decoder (such as BERT, RoBERTa GPT-2, etc.). A typical use case is image captioning.

1 Like

Oh ok. Thank you for your quick reply. Let me breakdown what you typed to make sure I understood.

I can use the visionencoderdecoder with say a ViT vision model pre training and a nlp distilbert pre training base or distilbert for sequence classification, and then I can use the combined model for multi label classification with image and text pairs?


Thank you for your help. I was able to get the model coded up with the following specs:

Encoder - google/vit-base-patch16-224-in21k
Decoder - bert-base-uncased.

Any suggestions for looking inside the model from an explainability standpoint? I was thinking about using something like shap - Welcome to the SHAP documentation — SHAP latest documentation but wasn’t sure if there were other options.

1 Like

@johnrodriguez190380 - Hi John, I am trying to code up for the same use case with the same combination, but having trouble with passing the preprocessing the image data and passing the pixel_values into the img2seq model. I tried passing pixel_values as input_ids after converting it into 3D features of dim 3, 224, 224 didn’t work.

Can I seek your advice?

Any vision model in the library expects pixel_values as input, which should be of shape (batch_size, num_channels, height, width).

ViT (and other models like DeiT, BEiT) expect the height and width to be divisible by the patch_size of the configuration (as these models split up the input image into a sequence of non-overlapping patches, typically of size 16x16 or 32x32).

You can use ViTFeatureExtractor to resize + normalize an image for such a model. Alternatively, you can use torchvision. You can check out the code example of VisionEncoderDecoderModel here.

Thanks @nielsr

I am using ViTFeatureextractor and TorchVision to resize the image, made sure the images are a batch of 3D arrays like you pointed out.

I have added all snippets, here

Along with the issue