Image captioning decoder

excuse me does the decoder of the language model deal with words or sentences to do the captioning . I mean, if i use BERT, will i use it for word embedding or sentence embedding … Thanks in advance

AFAIK you can use a GPT-2 for the decoder. Here’s an example with ViT + GPT-2.

thanks for the link but I need to use BERT and I couldn’t find it in the link

You can initialize the weights of the decoder with the weights of any encoder-only model too, like BERT. This is because a decoder is also just a stack of blocks (self-attention + feedforward neural networks), similar to an encoder. The only difference is that a decoder also adds cross-attention layers. Hence, if you initialize the weights of a decoder with the weights of an encoder-only model, the weights of the cross-attention layers will be randomly initialized, and need to be fine-tuned on a downstream task (like summarization, machine translation, image captioning).

Also, to answer your question: models like BERT/RoBERTa/GPT-x all operate on subword tokens, rather than words. This means that a word like “cookie” might be tokenized into “coo” and “kie”. These models learn an embedding vector for each individual subword token.

1 Like

thanks a lot for replying . so excuse me if extracted the features from images using CNN , can I extract weights of captions using BERT and apply them to the weights of the embedding layer and then use LSTM for example ?
and will I need to extract features for words using BERT from all layers or I need to ignore the classification layer?