Layoutlmv3 sequence_length vs token_sequnce_length size mismatch

I am trying to use Layoutlmv3Model for etracting features for every word/token, but sequecnce_length/outut shape is different than input. The output shape is (batch, sequence_length, embedding_dim), the sequence length is always 197 more than length of input ids (eg, if I keep max_length 100 in processor, my input_ids are of length 100 but the model output is (batch_len, 297, 768)). I am not sure what’s hapening and would like to know a way to map every token/subtoken to it’s extracted embedding. Are these the visual embeddings?


This is because LayoutLMv3 uses both image and text modalities as input. The 197 comes from the fact that there are 196 image patches + 1 for the special CLS token (as the image resolution is 224 and the patch resolution is 16 => (224/16)**2 = 196). So if you have input_ids of length 100, then the total number of tokens that are sent through the Transformer are 197 + 100 = 297.

1 Like

Thanks for clearing that up. I went through the code and there is a separate encoder block after concatenation of visual embedding and token embedding, so wanted to know if i simply do output_embedding[1:len(subword_tokens)+1] (assuming subword_tokens without cls/pad) to get embeddings for the subwords, those features now also contain features from all the modalities