Hey!
I am trying to get the image features/embeddings from BlipForImageTextRetrieval
.
According to the docs:
- image_embeds (
torch.FloatTensor
of shape(batch_size, output_dim)
optional returned when model is initialized withwith_projection=True
) — The image embeddings obtained by applying the projection layer to the pooler_output
How can i set the with_projection
to True, I can not find any information anywhere for this?
Is there any other way to extract the features?
How would i go on and extract the text_features (for a future use case)?
Thank you !
Edit: This is my current workaround
def get_vision_features(img_path, model, processor):
vision_model = model.vision_model
img = Image.open(img_path)
processed_images = processor(img, return_tensors="pt")
vision_embeddings = vision_model(pixel_values=processed_images.pixel_values, return_dict=True).pooler_output
vision_features = model.vision_proj(vision_embeddings)