Error when executing pix2struct-widget-captioning-base model

molntamas · March 29, 2023, 12:22pm

Hi!
I’m trying to run the pix2struct-widget-captioning-base model. (link)

When I am executing it like described on the model card, I get an error: “ValueError: A header text must be provided for VQA models.”

I think the model card description is missing the information how to add the bounding box for locating the widget, the description just says it should be the same as for pix2struct-textcaps-base, which I don’t think is the case.

Any idea how to add this information?
I’m quite new to using HuggingFace models.

Thank you!

molntamas · March 29, 2023, 2:43pm

I found that the code works if a text question is used.
But I’m not sure what format to use for the bounding box. Any example for that?

My code:

from PIL import Image
from transformers import Pix2StructForConditionalGeneration, Pix2StructProcessor

file_path = “…\test_images\full_screenshot.png”
image = Image.open(file_path).convert(“RGB”)
model = Pix2StructForConditionalGeneration.from_pretrained(“google/pix2struct-widget-captioning-base”)
processor = Pix2StructProcessor.from_pretrained(“google/pix2struct-widget-captioning-base”)

TODO: add a bounding box instead of the question

question = “What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud”

inputs = processor(images=image, text=question, return_tensors=“pt”)
predictions = model.generate(**inputs)
print(processor.decode(predictions[0], skip_special_tokens=True))

donfour · May 25, 2023, 3:40pm

DId you figure out how to use the model?

Currently I manually draw bounding boxes inside the image using an image editing tool. But I’m getting the same prediction no matter where I draw the bounding box, so I’m not sure if the model is working.

Below is my code:

import requests
from PIL import Image
from transformers import Pix2StructForConditionalGeneration, Pix2StructProcessor

model = Pix2StructForConditionalGeneration.from_pretrained("google/pix2struct-widget-captioning-base")
processor = Pix2StructProcessor.from_pretrained("google/pix2struct-widget-captioning-base")
image = Image.open("/path/to/image.png")
inputs = processor(images=image, text="", return_tensors="pt")

predictions = model.generate(**inputs)
print(processor.decode(predictions[0], skip_special_tokens=True))

molntamas · June 7, 2023, 8:04am

No, I couldn’t figure it out. I was checking the original google pix2struct github repo a bit, but didn’t find an example. Otherwise I’m not sure how to investigate further. Would be happy to see the solution.

doc2txt · February 11, 2024, 12:15pm

the model u are using is a captioning model, thus it will generate a caption for an image, so drawing a bounding box should not normally change the caption of the image - therefor u get the same prediction

molntamas · February 21, 2024, 7:59am

Hi doc2txt,
like the name of the model also implies it is a widget captioning model, we need to provide a bounding box, and the part of the image withing the bounding box gets captioned.
See the paper about pix2struct https://arxiv.org/pdf/2210.03347.pdf, page 6:
“Widget Captioning (Li et al., 2020b) is an image captioning task where the input is an app screenshot annotated with
a single bounding box denoting a widget (e.g. a button or
a scroll bar). The caption describes the functionality of the
widget (e.g. find location). VUT (Li et al., 2021b), the current SotA uses a specialized UI encoder combining images,
bounding boxes, and view hierarchies. Pix2StructLarge improves the SotA CIDEr from 127.4 to 136.7.”

Regards,
Tamas

SuperSteveHu · July 10, 2024, 8:07am

Hi molntamas,
I’m currently try ing to run pix2struct-widget-captioning, and I met the same problem:

need to describe bbox instead of texts as a header
“ValueError: A header text must be provided for VQA models.”

I’m wondering whether you have find a way to solve it

Thank you
Regards,
Steve

Topic		Replies	Views
Pytorch tokenizer unable to create tensor error Models	0	580	July 24, 2023
Could not fine-tune deplot model Models	3	483	January 10, 2024
ValueError: Invalid image type. Expected either PIL.Image.Image, numpy.ndarray, torch.Tensor, tf.Tensor or jax.ndarray, but got 🤗Transformers	6	4222	January 5, 2024
For google/deplot, what should I input as header text for fine-tuning? Models	7	1416	May 11, 2023
Image Captioning with ViT and GPT 2 Base Models	2	61	May 10, 2025

Error when executing pix2struct-widget-captioning-base model

TODO: add a bounding box instead of the question

Related topics