LayoutLMV3 Inference with lot of BBoxes


I followed the tutorial of @nielsr for LayoutLMV3 training and inference: Transformers-Tutorials/Fine_tune_LayoutLMv3_on_FUNSD_(HuggingFace_Trainer).ipynb at master 路 NielsRogge/Transformers-Tutorials 路 GitHub

However, inference is not straightforward for my use case as I have lot of BBoxes. I needed to pass the bboxes in slices so I get the prediction in all of them:

example = dataset["train"][23]
image = ImageOps.exif_transpose(example["image"])
words = example["tokens"]
boxes = example["bboxes"]
word_labels = example["ner_tags"]


words = [words[i:i+n] for i in range(0, len(words), n)]
boxes = [boxes[i:i+n] for i in range(0, len(boxes), n)]

draw = ImageDraw.Draw(image)

font = ImageFont.load_default()
width, height = image.size
def unnormalize_box(bbox, width, height):
     return [
         width * (bbox[0] / 1000),
         height * (bbox[1] / 1000),
         width * (bbox[2] / 1000),
         height * (bbox[3] / 1000),

for word_group, boxes_group in zip(words, boxes):

    encoding = processor(image, word_group, boxes=boxes_group, truncation=True, padding="max_length", return_tensors="pt")

    with torch.no_grad():
        outputs = model(**encoding)
    logits = outputs.logits
    predictions = logits.argmax(-1).squeeze().tolist()

    token_boxes = encoding.bbox.squeeze().tolist()

    true_predictions = [model.config.id2label[pred] for pred in predictions]
    true_boxes = [unnormalize_box(box, width, height) for box in token_boxes]

    for prediction, box in zip(true_predictions, true_boxes):
        predicted_label = prediction
        draw.rectangle(box, outline=label2color[predicted_label])
        draw.text((box[0] + 10, box[1] - 10), text=predicted_label, fill=label2color[predicted_label], font=font)


Is there a better way of doing this?