LayoutLMV3 embeddings

Kforcode · August 3, 2022, 6:54am

hi @neilsr @sgugger
I saw layoutlmv3 output embedding is (709, 768). which is greater than the max_position_embeddings = 512.

So i was wondering if the rest (709-512) = 197 is for image embeddings?
I read the paper it says

, the last layer outputs text-and-image
contextual representations

So is there a way we can seprate image and text embedding , like we can in layoutlmv2
if you can point me to any resource explaining this it would be great.
thanks

nielsr · August 3, 2022, 1:05pm

Hi,

yes the final embeddings (hidden states) that come out of LayoutLMv3Model are for both the text and image tokens. The final hidden states (outputs.last_hidden_state) is a tensor of shape (batch_size, seq_len, hidden_size), where the sequence length equals the number of text tokens + the number of image tokens + 1 (we add 1 for the special CLS token, which is useful for classification).

You can separate them by checking the length of the input_ids you sent through the model, like so:

text_seq_length = input_ids.shape[1]

# only take the text part of the output representations
sequence_output = outputs.last_hidden_state[:, :text_seq_length]

Kforcode · August 3, 2022, 1:28pm

thanks a lot, @nielsr following the same
is the [cls] token is at last here?
the last token should be [sep] right. so it should be
[cls] [512 * text tokens] [sep] [ num_patches * image tokens] . right?
can you please clarify more on the sequence of these tokens

nielsr · August 3, 2022, 1:57pm

No there’s no SEP token used, it’s just:

[CLS], text tokens, image tokens

as can be seen here - where the text embeddings and visual embeddings are concatenated.

The only time you’re using a [SEP] token is when doing question answering, then you’ll have

[CLS], question tokens, [SEP], context tokens, image tokens

Kforcode · August 3, 2022, 2:05pm

cool thanks

Topic		Replies	Views
Layoutlmv3 sequence_length vs token_sequnce_length size mismatch Models	2	701	November 19, 2022
LayoutLMv3 missing visual tokenizer? Beginners	7	481	January 4, 2023
Layoutlmv2 token classification on documents having tokens larger than 512 Models	8	2327	October 20, 2022
Replace roberta embedding with bge_base embedding in layoutlmv3 🤗Transformers	0	117	August 29, 2023
LayoutLMV3 inference without label 🤗Transformers	0	100	May 28, 2024

LayoutLMV3 embeddings

Related topics