Getting token probabilities of a caption given an image from BLIP2

I was considering the BLIP2 model for getting the probability distribution of each token in the caption given an image. So basically if the words in a caption are w1,w2,w3,…wt then I want to get these estimates P(w1|image), P(w2|image,w1),P(w3|image,w1,w2) etc. So this is the approach I took -

from PIL import Image
import requests
from transformers import Blip2Processor, Blip2ForConditionalGeneration
import torch

device = "cuda" 

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2Model.from_pretrained(
    "Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16
)
model.to(device)

url = "http://images.cocodataset.org/val2017/000000485895.jpg"
image = Image.open(requests.get(url, stream=True).raw)
pixel_values = processor(images=image, return_tensors="pt").pixel_values
pixel_values = pixel_values.to(device, torch.float16)

sentence = 'A giraffe stares into the camera while standing on green grass in front of a shade tree.'
input_ids = processor.tokenizer(sentence, return_tensors = 'pt').input_ids.to(device)

output = model(pixel_values,input_ids = input_ids).logits.detach()

considered_logits = output[:,32:,:]

So the considered logits is the probability distribution of each token in the caption, am I doing it right?

@joaogante what do you think?

@joaogante any thoughts?

Hey @snpushpi :wave:

Correct, that will get you the probability distribution for each token in the caption. Bear in mind that those are unnormalized logits :slight_smile:

From that probability distribution, you can fetch “P(w1|image), P(w2|image,w1),P(w3|image,w1,w2)”, using the token IDs in input_ids

I think the last line of your code is incorrect. when I run this code, I get that the output has dimension [1,20,50304]. why do you start at the 32 position?