I was considering the BLIP2 model for getting the probability distribution of each token in the caption given an image. So basically if the words in a caption are w1,w2,w3,…wt then I want to get these estimates P(w1|image), P(w2|image,w1),P(w3|image,w1,w2) etc. So this is the approach I took -
from PIL import Image
import requests
from transformers import Blip2Processor, Blip2ForConditionalGeneration
import torch
device = "cuda"
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2Model.from_pretrained(
"Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16
)
model.to(device)
url = "http://images.cocodataset.org/val2017/000000485895.jpg"
image = Image.open(requests.get(url, stream=True).raw)
pixel_values = processor(images=image, return_tensors="pt").pixel_values
pixel_values = pixel_values.to(device, torch.float16)
sentence = 'A giraffe stares into the camera while standing on green grass in front of a shade tree.'
input_ids = processor.tokenizer(sentence, return_tensors = 'pt').input_ids.to(device)
output = model(pixel_values,input_ids = input_ids).logits.detach()
considered_logits = output[:,32:,:]
So the considered logits is the probability distribution of each token in the caption, am I doing it right?