Adapting BLIP2 for zero-shot classification

Were you able to solve the task? I noticed that you are using a slightly different approach with respect to [1].
In the previous post, the output field qformer_outputs.last_hidden_state is used to synthesis the information from the qformer using the Blip2ForConditionalGeneration class. Your approach seems to be using Blip2Model.

As far as my understanding goes, the q-former already makes use of the vision model to generate its output. Could anyone with more experience explain which of these two methods is more effective?