Image captioning decoder

toyl · January 4, 2022, 1:12pm

excuse me does the decoder of the language model deal with words or sentences to do the captioning . I mean, if i use BERT, will i use it for word embedding or sentence embedding … Thanks in advance

merve · January 4, 2022, 6:20pm

Hello,
AFAIK you can use a GPT-2 for the decoder. Here’s an example with ViT + GPT-2.

toyl · January 4, 2022, 6:30pm

thanks for the link but I need to use BERT and I couldn’t find it in the link

nielsr · January 5, 2022, 9:04am

You can initialize the weights of the decoder with the weights of any encoder-only model too, like BERT. This is because a decoder is also just a stack of blocks (self-attention + feedforward neural networks), similar to an encoder. The only difference is that a decoder also adds cross-attention layers. Hence, if you initialize the weights of a decoder with the weights of an encoder-only model, the weights of the cross-attention layers will be randomly initialized, and need to be fine-tuned on a downstream task (like summarization, machine translation, image captioning).

Also, to answer your question: models like BERT/RoBERTa/GPT-x all operate on subword tokens, rather than words. This means that a word like “cookie” might be tokenized into “coo” and “kie”. These models learn an embedding vector for each individual subword token.

toyl · January 6, 2022, 4:53pm

thanks a lot for replying . so excuse me if extracted the features from images using CNN , can I extract weights of captions using BERT and apply them to the weights of the embedding layer and then use LSTM for example ?
and will I need to extract features for words using BERT from all layers or I need to ignore the classification layer?

Topic		Replies	Views
Does it make sense to generate sentences with Transofmrer's encoder? Research	0	384	May 22, 2021
Image captioning for French with pre-trained vision and text model Flax/JAX Projects	6	2191	January 4, 2022
Using EncoderDecoderModel 🤗Transformers	4	1098	October 28, 2021
Use BertLMHeadModel to finetunning a language model 🤗Transformers	0	330	March 30, 2021
VisionEncoderDecoder/TrOCR Models	0	714	October 21, 2021

Image captioning decoder

Related topics