I’m using the
ViTModel and saw the sequence length is 5 after making a forward pass while the number of patches is 4. I believe the extra element comes from the
[CLS] token that is concatenated to the embeddings in
ViTEmbeddings, but I don’t specify the class for the input example, so I’m not sure how this class token can be assigned.
last_hidden_state_vit = self.vit(x).last_hidden_state # [1, 5, 768] print(self.vit.get_input_embeddings().num_patches) # 4
Additionally, I saw in the ViTConfig the following parameters:
int, optional , defaults to 768) — Dimensionality of the encoder layers and the pooler layer.
int, optional , defaults to 3072) — Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.
But I’m not clear on which layers these dimensions refer to. I would really appreciate some clarification.
Thanks so much,