Blip-2 as a classification model

I was wondering is it even possible to use the Blip-2 model (Blip2ForConditionalGeneration) for classification-like tasks. I have not been able to find any thorough information on how to use this model using a classification head.

Also, if the answer is yes, then which features should be extracted to train the classifier on. I can think of two possibilities:

  1. Use the last_hidden_layer from the q-former and combine these features with the last_hidden_layer of the vision model; or
  2. Use the pooled output of the q-former.

I feel like this is an interesting topic, which unfortunately was not able to find much information about.
Any related tips would be really appreciated. Thanks!