Blip model gives no response

Hi there,
I’ve been struggling to recreate some very basic responses with answering questions about images. My main goal is to feed a model an architectural drawing and get it to make assessments. Here’s a reproducible example of what I’m experiencing:

from transformers import BlipProcessor, BlipForConditionalGeneration
import requests
from PIL import Image
model_name = "Salesforce/blip-vqa-base"
device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")
model.to(device)
  # Preprocess the image
url = 'https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png' 
image = Image.open(requests.get(url, stream=True).raw).convert('RGB')   
question="What is this?"
inputs = processor(image, question, return_tensors="pt")

# Ensure that inputs are on the same device as the model
inputs = {k: v.to(device) for k, v in inputs.items()}
# Generate an answer
with torch.no_grad():
    output = model.generate(**inputs) #max_length=200)
  
# Decode the answer
answer = processor.decode(output[0], skip_special_tokens=False)

I get back `answer=“What is this?” i.e. the model has not added anything.
I’ve tried different images, I’ve tried different models. I’m using a Mac, M2 arm64, python 3.9, transformers=4.37.2. My first thought was that it was related to my mac architecture.

hi @sketchcad
Please try this snippet from BLIP

from PIL import Image
import requests
from transformers import AutoProcessor, BlipForQuestionAnswering

model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")
processor = AutoProcessor.from_pretrained("Salesforce/blip-vqa-base")
url = "https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png"
image = Image.open(requests.get(url, stream=True).raw)

text = "What is this?"
inputs = processor(images=image, text=text, return_tensors="pt")
outputs = model.generate(**inputs)
print(processor.decode(outputs[0], skip_special_tokens=False))
1 Like