Combine word embedding with visual features from ViT model

For example I have a text nn.Embedding tensor W with size (1, 7, 768) [batch_size x seq_len x emb_size]
Using pretrained Vit model with the input image, I get the output with last_hidden_state L with size(1,196,768), not use CLS token, and pooler output P with size (1,768). How can I combine the word embedding tensor with the output of ViT model ? Is it correct to torch.cat([W, L],dim=1) to have a representative tensor of size (1, 203, 768) for text and image ?

Yes that’s indeed the correct way, by concatenating them along the sequence (also called time) dimension.

Thank you sir.
I also wonder in which kind of scenario that pooler_output would rather be used than the last_hidden_state tensor ?

The pooler_output can be seen as a “summary” of an entire sequence of tokens. There are typically 2 ways to obtain a summary of all tokens, either by using the pooler_output or by averaging the final hidden states of all tokens.