Hi,
I followed the tutorial of @nielsr for LayoutLMV3 training and inference: Transformers-Tutorials/Fine_tune_LayoutLMv3_on_FUNSD_(HuggingFace_Trainer).ipynb at master 路 NielsRogge/Transformers-Tutorials 路 GitHub
However, inference is not straightforward for my use case as I have lot of BBoxes. I needed to pass the bboxes in slices so I get the prediction in all of them:
example = dataset["train"][23]
image = ImageOps.exif_transpose(example["image"])
words = example["tokens"]
boxes = example["bboxes"]
word_labels = example["ner_tags"]
n=128
words = [words[i:i+n] for i in range(0, len(words), n)]
boxes = [boxes[i:i+n] for i in range(0, len(boxes), n)]
draw = ImageDraw.Draw(image)
font = ImageFont.load_default()
width, height = image.size
def unnormalize_box(bbox, width, height):
return [
width * (bbox[0] / 1000),
height * (bbox[1] / 1000),
width * (bbox[2] / 1000),
height * (bbox[3] / 1000),
]
for word_group, boxes_group in zip(words, boxes):
encoding = processor(image, word_group, boxes=boxes_group, truncation=True, padding="max_length", return_tensors="pt")
with torch.no_grad():
outputs = model(**encoding)
logits = outputs.logits
predictions = logits.argmax(-1).squeeze().tolist()
token_boxes = encoding.bbox.squeeze().tolist()
true_predictions = [model.config.id2label[pred] for pred in predictions]
true_boxes = [unnormalize_box(box, width, height) for box in token_boxes]
for prediction, box in zip(true_predictions, true_boxes):
predicted_label = prediction
draw.rectangle(box, outline=label2color[predicted_label])
draw.text((box[0] + 10, box[1] - 10), text=predicted_label, fill=label2color[predicted_label], font=font)
image
Is there a better way of doing this?