CLIPTextModel's get_text_features VS pooled outputs

I think the projection layer is just there to project the text features (coming from the text encoder) and the image features (coming from the vision encoder) into the same embedding space. The vision encoder uses a different dimensionality (e.g. 1024 for openai/clip-vit-large-patch14 · Hugging Face), so the projection layers on both the text and vision sides make sure this is projected to the same dimensionality (like 512 or 768).

1 Like