CLIPTextModel's get_text_features VS pooled outputs

nielsr · August 30, 2024, 10:34am

I think the projection layer is just there to project the text features (coming from the text encoder) and the image features (coming from the vision encoder) into the same embedding space. The vision encoder uses a different dimensionality (e.g. 1024 for openai/clip-vit-large-patch14 · Hugging Face), so the projection layers on both the text and vision sides make sure this is projected to the same dimensionality (like 512 or 768).

Topic		Replies	Views
How to obtain correct text embeddings from CLIP? 🤗Transformers	1	9058	February 6, 2023
How to get an embedding of size 512 using CLIP equal to open_clip? Beginners	3	1690	February 20, 2024
Retrieve BlipForImageTextRetrieval image features 🤗Transformers	0	418	September 17, 2023
How to condition Stable-Diffusion on CLIP image embeddings? 🧨 Diffusers	0	1307	February 4, 2024
Stable Diffusion CLIP similarity 🧨 Diffusers	6	4609	December 6, 2022

CLIPTextModel's get_text_features VS pooled outputs

Related topics