How to generate logit_scores to caption

ummehabiba17 · April 15, 2024, 1:42pm

from PIL import Image

import requests

from transformers import AutoProcessor, BlipForConditionalGeneration, AutoTokenizer

processor_base= AutoProcessor.from_pretrained(“Salesforce/blip-image-captioning-base”)

model_base= BlipForConditionalGeneration.from_pretrained(“Salesforce/blip-image-captioning-base”)

url = “https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTrp7yKuY1NxcXlHQX10JtTlECuna-xWv-jetxnv73WBw&s”

image = Image.open(requests.get(url, stream=True).raw)

inputs_base= processor_base(images=image, return_tensors=“pt”)

pixel_values = inputs_base.pixel_values

outputs_base = model_base.generate(**inputs_base,output_scores=True, output_logits = True, return_dict_in_generate=True, max_length=50 , return_dict = True)
logits = outputs_base.logits

// How i generate the caption from this logits performing the softmax

RaushanTurganbay · April 17, 2024, 7:50am

Hey!

You can get the generated caption from output.sequences.

out_text = tokenizer.batch_decode(output.sequences)
print(out_text)

If you want to work with the scores, you should be able to just take softmax over scores. Note that scores here are the scores after applying all types of processings (temperature, normalization etc.) that were passed into generate. But in your code snippet there are no such processings

gen_token_ids = torch.argmax(scores, dim=-1)
out_text = tokenizer.batch_decode(gen_token_ids)

ummehabiba17 · April 20, 2024, 2:49pm

from transformers import BlipForConditionalGeneration,AutoProcessor,AutoTokenizer
from PIL import Image
import requests
import torch
processor= AutoProcessor.from_pretrained(“Salesforce/blip-image-captioning-base”)
model= BlipForConditionalGeneration.from_pretrained(“Salesforce/blip-image-captioning-base”)
tokenizer = AutoTokenizer.from_pretrained(“Salesforce/blip-image-captioning-base”)

url = “http://images.cocodataset.org/val2017/000000039769.jpg”
image = Image.open(requests.get(url, stream=True).raw)
text = “A picture of”

decoder_input_ids = [model.decoder_input_ids]
print(decoder_input_ids)
inputs.input_ids = decoder_input_ids=torch.tensor([decoder_input_ids])
predicted_ids =
for i in range(20):
outputs = model(**inputs)
logits = outputs.logits[:,i,:]
print(outputs.logits.shape)
predicted_id = logits.argmax(-1)
predicted_ids.append(predicted_id.item())
print(tokenizer.decode([predicted_id.squeeze()]))
add predicted id to decoder_input_ids
inputs.input_ids = torch.cat([inputs.input_ids, predicted_id.unsqueeze(0)], dim=1)

for this code for every image i found same logits shape [1, 5, 30524] . and also every iteration the logits are remain same. no change in logtis for changing the inputs.input_ids. why i face this error? and how to solve this?

Topic		Replies	Views
Manually generate generate_ids using BlipForConditionalGeneration Models	0	137	April 21, 2024
Getting token probabilities of a caption given an image from BLIP2 🤗Transformers	4	476	October 13, 2024
Logits from generate and model call different 🤗Transformers	2	936	January 26, 2025
Can BlipForImageTextRetrieval be used to generate captions? Models	3	957	September 14, 2023
I would like to finetune the blip model on ROCO data set for image captioning of chest x-rays 🤗Transformers	0	588	February 12, 2023

How to generate logit_scores to caption

Related topics