Error when executing pix2struct-widget-captioning-base model

Hi!
I’m trying to run the pix2struct-widget-captioning-base model. (link)

When I am executing it like described on the model card, I get an error: “ValueError: A header text must be provided for VQA models.”

I think the model card description is missing the information how to add the bounding box for locating the widget, the description just says it should be the same as for pix2struct-textcaps-base, which I don’t think is the case.

Any idea how to add this information?
I’m quite new to using HuggingFace models.

Thank you!

1 Like

I found that the code works if a text question is used.
But I’m not sure what format to use for the bounding box. Any example for that?

My code:

from PIL import Image
from transformers import Pix2StructForConditionalGeneration, Pix2StructProcessor

file_path = “…\test_images\full_screenshot.png”
image = Image.open(file_path).convert(“RGB”)
model = Pix2StructForConditionalGeneration.from_pretrained(“google/pix2struct-widget-captioning-base”)
processor = Pix2StructProcessor.from_pretrained(“google/pix2struct-widget-captioning-base”)

TODO: add a bounding box instead of the question

question = “What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud”

inputs = processor(images=image, text=question, return_tensors=“pt”)
predictions = model.generate(**inputs)
print(processor.decode(predictions[0], skip_special_tokens=True))

DId you figure out how to use the model?

Currently I manually draw bounding boxes inside the image using an image editing tool. But I’m getting the same prediction no matter where I draw the bounding box, so I’m not sure if the model is working.

Below is my code:

import requests
from PIL import Image
from transformers import Pix2StructForConditionalGeneration, Pix2StructProcessor

model = Pix2StructForConditionalGeneration.from_pretrained("google/pix2struct-widget-captioning-base")
processor = Pix2StructProcessor.from_pretrained("google/pix2struct-widget-captioning-base")
image = Image.open("/path/to/image.png")
inputs = processor(images=image, text="", return_tensors="pt")

predictions = model.generate(**inputs)
print(processor.decode(predictions[0], skip_special_tokens=True))

No, I couldn’t figure it out. I was checking the original google pix2struct github repo a bit, but didn’t find an example. Otherwise I’m not sure how to investigate further. Would be happy to see the solution.

the model u are using is a captioning model, thus it will generate a caption for an image, so drawing a bounding box should not normally change the caption of the image - therefor u get the same prediction

Hi doc2txt,
like the name of the model also implies it is a widget captioning model, we need to provide a bounding box, and the part of the image withing the bounding box gets captioned.
See the paper about pix2struct https://arxiv.org/pdf/2210.03347.pdf, page 6:
“Widget Captioning (Li et al., 2020b) is an image captioning task where the input is an app screenshot annotated with
a single bounding box denoting a widget (e.g. a button or
a scroll bar). The caption describes the functionality of the
widget (e.g. find location). VUT (Li et al., 2021b), the current SotA uses a specialized UI encoder combining images,
bounding boxes, and view hierarchies. Pix2StructLarge improves the SotA CIDEr from 127.4 to 136.7.”

Regards,
Tamas