What determines CLS token value in ViTModel?

EricWiener · April 15, 2022, 5:55pm

Hi,

I’m using the ViTModel and saw the sequence length is 5 after making a forward pass while the number of patches is 4. I believe the extra element comes from the [CLS] token that is concatenated to the embeddings in ViTEmbeddings, but I don’t specify the class for the input example, so I’m not sure how this class token can be assigned.

last_hidden_state_vit = self.vit(x).last_hidden_state # [1, 5, 768]
print(self.vit.get_input_embeddings().num_patches) # 4

Additionally, I saw in the ViTConfig the following parameters:

hidden_size ( int , optional , defaults to 768) — Dimensionality of the encoder layers and the pooler layer.
intermediate_size ( int , optional , defaults to 3072) — Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.

But I’m not clear on which layers these dimensions refer to. I would really appreciate some clarification.

Thanks so much,
Eric

Topic		Replies	Views
Layoutlmv3 sequence_length vs token_sequnce_length size mismatch Models	2	697	November 19, 2022
Identical CLS token embeddings for all different sentences? Beginners	1	451	April 17, 2023
Size of last_hidden_state and mask in ViTMAE Beginners	2	339	January 23, 2024
Getting the CLS token from ViTMAEForPreTraining Models	0	648	August 16, 2022
LayoutLMV3 embeddings Beginners	4	1102	August 3, 2022

What determines CLS token value in ViTModel?

Related topics