Img2seq model with pretrained weights

mtoconnor3 · August 16, 2020, 6:54pm

Hi there,

I’m very new to this transformers business, and have been playing around with the HF code to learn things as I tinker.

the context:
One thing I would like to do is build an encoder/decoder out of a CNN and a Transformer Decoder, for generating text from images. My wife likes to pretend that she’s seen movies, but when prompted for the plot she just summarizes contextual clues she gets from the movie posters. I want to see if I can get a transformer to do the same thing.

I have looked at what’s out there on the web, and most of the decoders I find for this kind of task are based on recurrent networks. Instead, I would like to adapt a pretrained transformer model to do the same thing.

the question:
Given a pre-trained CNN encoder, what would be the best way to extract the decoder from a pre-trained GPT/BERT model? I would ideally like to fine-tune something that’s already in a good place to begin with. I’m working with limited computational resources (a pair of consumer GPUs with about 12gb of vram in total) and a small training dataset (a few thousand movie synopses with corresponding images).

Cheers,
Mike

johnrodriguez190380 · October 21, 2021, 7:35am

Interesting project. My suggestion would be to take the Transformer based ViT and merge that with a decoder as a sequence to sequence function but with cross attention.

nielsr · October 21, 2021, 8:05am

You can do this easily now, as I’ve recently added a generic VisionEncoderDecoder model class. It allows you to mix-and-match any vision Transformer encoder (such as ViT, DeiT, BEiT) with any text Transformer decoder (such as BERT, RoBERTa GPT-2, etc.). A typical use case is image captioning.

johnrodriguez190380 · October 21, 2021, 3:14pm

Oh ok. Thank you for your quick reply. Let me breakdown what you typed to make sure I understood.

I can use the visionencoderdecoder with say a ViT vision model pre training and a nlp distilbert pre training base or distilbert for sequence classification, and then I can use the combined model for multi label classification with image and text pairs?

~john

johnrodriguez190380 · October 25, 2021, 9:09pm

Thank you for your help. I was able to get the model coded up with the following specs:

Encoder - google/vit-base-patch16-224-in21k
Decoder - bert-base-uncased.

Any suggestions for looking inside the model from an explainability standpoint? I was thinking about using something like shap - Welcome to the SHAP documentation — SHAP latest documentation but wasn’t sure if there were other options.

prithivida · November 18, 2021, 11:57am

@johnrodriguez190380 - Hi John, I am trying to code up for the same use case with the same combination, but having trouble with passing the preprocessing the image data and passing the pixel_values into the img2seq model. I tried passing pixel_values as input_ids after converting it into 3D features of dim 3, 224, 224 didn’t work.

Can I seek your advice?

nielsr · November 18, 2021, 1:13pm

Any vision model in the library expects pixel_values as input, which should be of shape (batch_size, num_channels, height, width).

ViT (and other models like DeiT, BEiT) expect the height and width to be divisible by the patch_size of the configuration (as these models split up the input image into a sequence of non-overlapping patches, typically of size 16x16 or 32x32).

You can use ViTFeatureExtractor to resize + normalize an image for such a model. Alternatively, you can use torchvision. You can check out the code example of VisionEncoderDecoderModel here.

prithivida · November 18, 2021, 1:37pm

Thanks @nielsr

I am using ViTFeatureextractor and TorchVision to resize the image, made sure the images are a batch of 3D arrays like you pointed out.

I have added all snippets, here
https://github.com/huggingface/transformers/issues/14444

Along with the issue

Topic		Replies	Views
Image captioning for Spanish with pre-trained vision and text model Flax/JAX Projects	13	2473	July 19, 2021
Image captioning for Japanese with pre-trained vision and text model Flax/JAX Projects	0	1165	June 23, 2021
Image Captioning - ViT + BERT with WIT Flax/JAX Projects	2	4078	October 21, 2021
Image captioning for French with pre-trained vision and text model Flax/JAX Projects	6	2159	January 4, 2022
Is it possible to use Vision Encoder Decoder model for extracting text in document and then classifying the extracted texts 🤗Transformers	0	228	April 12, 2023

Img2seq model with pretrained weights

Related topics