What determines CLS token value in ViTModel?


I’m using the ViTModel and saw the sequence length is 5 after making a forward pass while the number of patches is 4. I believe the extra element comes from the [CLS] token that is concatenated to the embeddings in ViTEmbeddings, but I don’t specify the class for the input example, so I’m not sure how this class token can be assigned.

last_hidden_state_vit = self.vit(x).last_hidden_state # [1, 5, 768]
print(self.vit.get_input_embeddings().num_patches) # 4

Additionally, I saw in the ViTConfig the following parameters:

  • hidden_size ( int , optional , defaults to 768) — Dimensionality of the encoder layers and the pooler layer.
  • intermediate_size ( int , optional , defaults to 3072) — Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.

But I’m not clear on which layers these dimensions refer to. I would really appreciate some clarification.

Thanks so much,