Model inference using batch (Encoder-decoder)

rlee002 · September 13, 2023, 5:31am

Hi, I am trying to get an inference of an image using the Pix2struct vision transformer model. Currently, I am generating one inference at a time and the code I am using is in below.

processor = Pix2StructProcessor.from_pretrained(
    "google/deplot", is_vqa=True
)
model = Pix2StructForConditionalGeneration.from_pretrained(
    "google/deplot", is_vqa=True
).to(device)

with open('./data/test_imgs/test.png', "rb") as f:
    image = Image.open(f).convert("RGB")

    inputs = processor(
        images=image,
        text="Generate underlying data table of the figure below:",
        return_tensors="pt",
    ).to(device)
    predictions = model.generate(**inputs, max_new_tokens=512)
    deplot_result = processor.decode(
            predictions[0], skip_special_tokens=True
    )
    
    print(deplot_result)

However, the inference time for this method is ~45 secs/image, which is not viable for our project. Is there a way to convert this code into using batches so that I can generate multiple predictions at the same time?

Topic		Replies	Views
Fine tuning tips for Pix2Struct DOCVQA Models	0	565	August 1, 2023
Could not fine-tune deplot model Models	3	483	January 10, 2024
TrOCR - inference on images in parallel Beginners	3	696	December 13, 2023
Recommended way to perform batch inference for generation 🤗Transformers	0	2531	March 6, 2021
TPU Out of memory in Pix2Struct ForConditionalGeneration model Intermediate	0	247	August 13, 2023

Model inference using batch (Encoder-decoder)

Related topics