Hi there,
I’ve been struggling to recreate some very basic responses with answering questions about images. My main goal is to feed a model an architectural drawing and get it to make assessments. Here’s a reproducible example of what I’m experiencing:
from transformers import BlipProcessor, BlipForConditionalGeneration
import requests
from PIL import Image
model_name = "Salesforce/blip-vqa-base"
device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")
model.to(device)
# Preprocess the image
url = 'https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png'
image = Image.open(requests.get(url, stream=True).raw).convert('RGB')
question="What is this?"
inputs = processor(image, question, return_tensors="pt")
# Ensure that inputs are on the same device as the model
inputs = {k: v.to(device) for k, v in inputs.items()}
# Generate an answer
with torch.no_grad():
output = model.generate(**inputs) #max_length=200)
# Decode the answer
answer = processor.decode(output[0], skip_special_tokens=False)
I get back `answer=“What is this?” i.e. the model has not added anything.
I’ve tried different images, I’ve tried different models. I’m using a Mac, M2 arm64, python 3.9, transformers=4.37.2. My first thought was that it was related to my mac architecture.