# Getting token probabilities of a caption given an image from BLIP2

I was considering the BLIP2 model for getting the probability distribution of each token in the caption given an image. So basically if the words in a caption are w1,w2,w3,…wt then I want to get these estimates P(w1|image), P(w2|image,w1),P(w3|image,w1,w2) etc. So this is the approach I took -

``````from PIL import Image
import requests
from transformers import Blip2Processor, Blip2ForConditionalGeneration
import torch

device = "cuda"

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2Model.from_pretrained(
"Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16
)
model.to(device)

url = "http://images.cocodataset.org/val2017/000000485895.jpg"
image = Image.open(requests.get(url, stream=True).raw)
pixel_values = processor(images=image, return_tensors="pt").pixel_values
pixel_values = pixel_values.to(device, torch.float16)

sentence = 'A giraffe stares into the camera while standing on green grass in front of a shade tree.'
input_ids = processor.tokenizer(sentence, return_tensors = 'pt').input_ids.to(device)

output = model(pixel_values,input_ids = input_ids).logits.detach()

considered_logits = output[:,32:,:]
``````

So the considered logits is the probability distribution of each token in the caption, am I doing it right?

@joaogante what do you think?

@joaogante any thoughts?

Hey @snpushpi Correct, that will get you the probability distribution for each token in the caption. Bear in mind that those are unnormalized logits From that probability distribution, you can fetch “P(w1|image), P(w2|image,w1),P(w3|image,w1,w2)”, using the token IDs in `input_ids`