Model inference using batch (Encoder-decoder)

Hi, I am trying to get an inference of an image using the Pix2struct vision transformer model. Currently, I am generating one inference at a time and the code I am using is in below.

processor = Pix2StructProcessor.from_pretrained(
    "google/deplot", is_vqa=True
model = Pix2StructForConditionalGeneration.from_pretrained(
    "google/deplot", is_vqa=True

with open('./data/test_imgs/test.png', "rb") as f:
    image ="RGB")

    inputs = processor(
        text="Generate underlying data table of the figure below:",
    predictions = model.generate(**inputs, max_new_tokens=512)
    deplot_result = processor.decode(
            predictions[0], skip_special_tokens=True

However, the inference time for this method is ~45 secs/image, which is not viable for our project. Is there a way to convert this code into using batches so that I can generate multiple predictions at the same time?