Hi!
I’m trying to run the pix2struct-widget-captioning-base model. (link)
When I am executing it like described on the model card, I get an error: “ValueError: A header text must be provided for VQA models.”
I think the model card description is missing the information how to add the bounding box for locating the widget, the description just says it should be the same as for pix2struct-textcaps-base, which I don’t think is the case.
Any idea how to add this information?
I’m quite new to using HuggingFace models.
Currently I manually draw bounding boxes inside the image using an image editing tool. But I’m getting the same prediction no matter where I draw the bounding box, so I’m not sure if the model is working.
Below is my code:
import requests
from PIL import Image
from transformers import Pix2StructForConditionalGeneration, Pix2StructProcessor
model = Pix2StructForConditionalGeneration.from_pretrained("google/pix2struct-widget-captioning-base")
processor = Pix2StructProcessor.from_pretrained("google/pix2struct-widget-captioning-base")
image = Image.open("/path/to/image.png")
inputs = processor(images=image, text="", return_tensors="pt")
predictions = model.generate(**inputs)
print(processor.decode(predictions[0], skip_special_tokens=True))
No, I couldn’t figure it out. I was checking the original google pix2struct github repo a bit, but didn’t find an example. Otherwise I’m not sure how to investigate further. Would be happy to see the solution.
the model u are using is a captioning model, thus it will generate a caption for an image, so drawing a bounding box should not normally change the caption of the image - therefor u get the same prediction
Hi doc2txt,
like the name of the model also implies it is a widget captioning model, we need to provide a bounding box, and the part of the image withing the bounding box gets captioned.
See the paper about pix2struct https://arxiv.org/pdf/2210.03347.pdf, page 6:
“Widget Captioning (Li et al., 2020b) is an image captioning task where the input is an app screenshot annotated with
a single bounding box denoting a widget (e.g. a button or
a scroll bar). The caption describes the functionality of the
widget (e.g. find location). VUT (Li et al., 2021b), the current SotA uses a specialized UI encoder combining images,
bounding boxes, and view hierarchies. Pix2StructLarge improves the SotA CIDEr from 127.4 to 136.7.”