Getting token probabilities of a caption given an image from BLIP2

snpushpi · September 27, 2023, 1:51am

I was considering the BLIP2 model for getting the probability distribution of each token in the caption given an image. So basically if the words in a caption are w1,w2,w3,…wt then I want to get these estimates P(w1|image), P(w2|image,w1),P(w3|image,w1,w2) etc. So this is the approach I took -

from PIL import Image
import requests
from transformers import Blip2Processor, Blip2ForConditionalGeneration
import torch

device = "cuda" 

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2Model.from_pretrained(
    "Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16
)
model.to(device)

url = "http://images.cocodataset.org/val2017/000000485895.jpg"
image = Image.open(requests.get(url, stream=True).raw)
pixel_values = processor(images=image, return_tensors="pt").pixel_values
pixel_values = pixel_values.to(device, torch.float16)

sentence = 'A giraffe stares into the camera while standing on green grass in front of a shade tree.'
input_ids = processor.tokenizer(sentence, return_tensors = 'pt').input_ids.to(device)

output = model(pixel_values,input_ids = input_ids).logits.detach()

considered_logits = output[:,32:,:]

So the considered logits is the probability distribution of each token in the caption, am I doing it right?

snpushpi · September 27, 2023, 2:35am

@joaogante what do you think?

snpushpi · October 3, 2023, 10:29pm

@joaogante any thoughts?

joaogante · October 6, 2023, 7:24am

Hey @snpushpi

Correct, that will get you the probability distribution for each token in the caption. Bear in mind that those are unnormalized logits

From that probability distribution, you can fetch “P(w1|image), P(w2|image,w1),P(w3|image,w1,w2)”, using the token IDs in input_ids

paullintilhac2 · October 13, 2024, 4:54am

I think the last line of your code is incorrect. when I run this code, I get that the output has dimension [1,20,50304]. why do you start at the 32 position?

Topic		Replies	Views
How to generate logit_scores to caption Models	2	142	April 20, 2024
Is this the correct method to get probabilities? Beginners	0	142	May 10, 2024
Getting probability distributions of T5 outputs 🤗Transformers	0	1159	August 30, 2022
Solution for Fine Tuning the Blip Model 🤗Transformers	0	95	December 13, 2024
Get score for specific token in vocab set when generating withBLIP Models	0	15	September 16, 2024

Getting token probabilities of a caption given an image from BLIP2

Related topics