Hello, I was wondering if there is any way or examples that show how to extract text and image features from Blip-2 in the same embeddings space, ideally to be used for image-text matching. Or perhaps this model is not meant to perform this task? I can extract the text and image features, but they are not in the same space and do not have the same shape.
For example, in the original Blip-2 codebase, there is an example on how to use it for image-text matching, but it seems that this feature is not available in the HuggingFace version: LAVIS/examples/blip2_image_text_matching.ipynb at 3446bac20c5646d35ae383ebe6d13cec4f8b00cb · salesforce/LAVIS · GitHub