Combine word embedding with visual features from ViT model

daeron · November 10, 2022, 7:22am

For example I have a text nn.Embedding tensor W with size (1, 7, 768) [batch_size x seq_len x emb_size]
Using pretrained Vit model with the input image, I get the output with last_hidden_state L with size(1,196,768), not use CLS token, and pooler output P with size (1,768). How can I combine the word embedding tensor with the output of ViT model ? Is it correct to torch.cat([W, L],dim=1) to have a representative tensor of size (1, 203, 768) for text and image ?

nielsr · November 10, 2022, 9:00am

Yes that’s indeed the correct way, by concatenating them along the sequence (also called time) dimension.

daeron · November 10, 2022, 4:46pm

Thank you sir.
I also wonder in which kind of scenario that pooler_output would rather be used than the last_hidden_state tensor ?

nielsr · November 11, 2022, 10:01am

The pooler_output can be seen as a “summary” of an entire sequence of tokens. There are typically 2 ways to obtain a summary of all tokens, either by using the pooler_output or by averaging the final hidden states of all tokens.

Topic		Replies	Views
Last hidden state vs pooler output in CLIPVisionModel Beginners	1	8507	November 18, 2022
How to get a fixed size embedding from the last hidden state of vision models? 🤗Transformers	0	799	April 28, 2022
What is the correct way to create a feature extractor for a hugging face (HF) ViT model? Intermediate	1	1050	April 6, 2023
CLIPTextModel's get_text_features VS pooled outputs 🤗Transformers	1	465	August 30, 2024
How to get embedding matrix of bert in hugging face Beginners	8	41085	October 31, 2024

Combine word embedding with visual features from ViT model

Related topics