hi @neilsr @sgugger
I saw layoutlmv3 output embedding is (709, 768). which is greater than the max_position_embeddings = 512.
So i was wondering if the rest (709-512) = 197 is for image embeddings?
I read the paper it says
, the last layer outputs text-and-image
So is there a way we can seprate image and text embedding , like we can in layoutlmv2
if you can point me to any resource explaining this it would be great.
yes the final embeddings (hidden states) that come out of LayoutLMv3Model are for both the text and image tokens. The final hidden states (
outputs.last_hidden_state) is a tensor of shape (batch_size, seq_len, hidden_size), where the sequence length equals the number of text tokens + the number of image tokens + 1 (we add 1 for the special CLS token, which is useful for classification).
You can separate them by checking the length of the input_ids you sent through the model, like so:
text_seq_length = input_ids.shape
# only take the text part of the output representations
sequence_output = outputs.last_hidden_state[:, :text_seq_length]
thanks a lot, @nielsr following the same
is the [cls] token is at last here?
the last token should be [sep] right. so it should be
[cls] [512 * text tokens] [sep] [ num_patches * image tokens] . right?
can you please clarify more on the sequence of these tokens
No there’s no SEP token used, it’s just:
[CLS], text tokens, image tokens
as can be seen here - where the text embeddings and visual embeddings are concatenated.
The only time you’re using a [SEP] token is when doing question answering, then you’ll have
[CLS], question tokens, [SEP], context tokens, image tokens