Retrieve BlipForImageTextRetrieval image features


I am trying to get the image features/embeddings from BlipForImageTextRetrieval .

According to the docs:

  • image_embeds (torch.FloatTensor of shape (batch_size, output_dim) optional returned when model is initialized with with_projection=True) — The image embeddings obtained by applying the projection layer to the pooler_output

How can i set the with_projection to True, I can not find any information anywhere for this?
Is there any other way to extract the features?

How would i go on and extract the text_features (for a future use case)?

Thank you !

Edit: This is my current workaround

def get_vision_features(img_path, model, processor):
    vision_model = model.vision_model
    img =
    processed_images = processor(img, return_tensors="pt")

    vision_embeddings = vision_model(pixel_values=processed_images.pixel_values, return_dict=True).pooler_output

    vision_features = model.vision_proj(vision_embeddings)