Error when executing pix2struct-widget-captioning-base model

I’m trying to run the pix2struct-widget-captioning-base model. (link)

When I am executing it like described on the model card, I get an error: “ValueError: A header text must be provided for VQA models.”

I think the model card description is missing the information how to add the bounding box for locating the widget, the description just says it should be the same as for pix2struct-textcaps-base, which I don’t think is the case.

Any idea how to add this information?
I’m quite new to using HuggingFace models.

Thank you!

1 Like

I found that the code works if a text question is used.
But I’m not sure what format to use for the bounding box. Any example for that?

My code:

from PIL import Image
from transformers import Pix2StructForConditionalGeneration, Pix2StructProcessor

file_path = “…\test_images\full_screenshot.png”
image =“RGB”)
model = Pix2StructForConditionalGeneration.from_pretrained(“google/pix2struct-widget-captioning-base”)
processor = Pix2StructProcessor.from_pretrained(“google/pix2struct-widget-captioning-base”)

TODO: add a bounding box instead of the question

question = “What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud”

inputs = processor(images=image, text=question, return_tensors=“pt”)
predictions = model.generate(**inputs)
print(processor.decode(predictions[0], skip_special_tokens=True))

DId you figure out how to use the model?

Currently I manually draw bounding boxes inside the image using an image editing tool. But I’m getting the same prediction no matter where I draw the bounding box, so I’m not sure if the model is working.

Below is my code:

import requests
from PIL import Image
from transformers import Pix2StructForConditionalGeneration, Pix2StructProcessor

model = Pix2StructForConditionalGeneration.from_pretrained("google/pix2struct-widget-captioning-base")
processor = Pix2StructProcessor.from_pretrained("google/pix2struct-widget-captioning-base")
image ="/path/to/image.png")
inputs = processor(images=image, text="", return_tensors="pt")

predictions = model.generate(**inputs)
print(processor.decode(predictions[0], skip_special_tokens=True))

No, I couldn’t figure it out. I was checking the original google pix2struct github repo a bit, but didn’t find an example. Otherwise I’m not sure how to investigate further. Would be happy to see the solution.