Hi,
I’m using the ViTModel
and saw the sequence length is 5 after making a forward pass while the number of patches is 4. I believe the extra element comes from the [CLS]
token that is concatenated to the embeddings in ViTEmbeddings
, but I don’t specify the class for the input example, so I’m not sure how this class token can be assigned.
last_hidden_state_vit = self.vit(x).last_hidden_state # [1, 5, 768]
print(self.vit.get_input_embeddings().num_patches) # 4
Additionally, I saw in the ViTConfig the following parameters:
-
hidden_size (
int
, optional , defaults to 768) — Dimensionality of the encoder layers and the pooler layer. -
intermediate_size (
int
, optional , defaults to 3072) — Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.
But I’m not clear on which layers these dimensions refer to. I would really appreciate some clarification.
Thanks so much,
Eric