I’m very new to this transformers business, and have been playing around with the HF code to learn things as I tinker.
the context:
One thing I would like to do is build an encoder/decoder out of a CNN and a Transformer Decoder, for generating text from images. My wife likes to pretend that she’s seen movies, but when prompted for the plot she just summarizes contextual clues she gets from the movie posters. I want to see if I can get a transformer to do the same thing.
I have looked at what’s out there on the web, and most of the decoders I find for this kind of task are based on recurrent networks. Instead, I would like to adapt a pretrained transformer model to do the same thing.
the question:
Given a pre-trained CNN encoder, what would be the best way to extract the decoder from a pre-trained GPT/BERT model? I would ideally like to fine-tune something that’s already in a good place to begin with. I’m working with limited computational resources (a pair of consumer GPUs with about 12gb of vram in total) and a small training dataset (a few thousand movie synopses with corresponding images).
Interesting project. My suggestion would be to take the Transformer based ViT and merge that with a decoder as a sequence to sequence function but with cross attention.
You can do this easily now, as I’ve recently added a generic VisionEncoderDecoder model class. It allows you to mix-and-match any vision Transformer encoder (such as ViT, DeiT, BEiT) with any text Transformer decoder (such as BERT, RoBERTa GPT-2, etc.). A typical use case is image captioning.
Oh ok. Thank you for your quick reply. Let me breakdown what you typed to make sure I understood.
I can use the visionencoderdecoder with say a ViT vision model pre training and a nlp distilbert pre training base or distilbert for sequence classification, and then I can use the combined model for multi label classification with image and text pairs?
@johnrodriguez190380 - Hi John, I am trying to code up for the same use case with the same combination, but having trouble with passing the preprocessing the image data and passing the pixel_values into the img2seq model. I tried passing pixel_values as input_ids after converting it into 3D features of dim 3, 224, 224 didn’t work.
Any vision model in the library expects pixel_values as input, which should be of shape (batch_size, num_channels, height, width).
ViT (and other models like DeiT, BEiT) expect the height and width to be divisible by the patch_size of the configuration (as these models split up the input image into a sequence of non-overlapping patches, typically of size 16x16 or 32x32).
You can use ViTFeatureExtractor to resize + normalize an image for such a model. Alternatively, you can use torchvision. You can check out the code example of VisionEncoderDecoderModelhere.