Were you able to solve the task? I noticed that you are using a slightly different approach with respect to [1].
In the previous post, the output field qformer_outputs.last_hidden_state
is used to synthesis the information from the qformer
using the Blip2ForConditionalGeneration
class. Your approach seems to be using Blip2Model
.
As far as my understanding goes, the q-former already makes use of the vision model to generate its output. Could anyone with more experience explain which of these two methods is more effective?