LayoutLMv3 Q/A Inference

Bapt120 · January 17, 2023, 10:24am

Hi , i’m a begginer on this platform. For my master degree’s project i have to use the LayoutLM model (and more precisely for question answering on documents).

I have few questions about the inference of the model for Q/A.

When i read the documentation i found this for the inference of the LayoutLMv1 Q/A model :

from transformers import AutoTokenizer, LayoutLMForQuestionAnswering
from datasets import load_dataset
import torch

tokenizer = AutoTokenizer.from_pretrained("impira/layoutlm-document-qa", add_prefix_space=True)
model = LayoutLMForQuestionAnswering.from_pretrained("impira/layoutlm-document-qa", revision="1e3ebac")

dataset = load_dataset("nielsr/funsd", split="train")
example = dataset[0]
question = "what's his name?"
words = example["words"]
boxes = example["bboxes"]

encoding = tokenizer(
    question.split(), words, is_split_into_words=True, return_token_type_ids=True, return_tensors="pt"
)
bbox = []
for i, s, w in zip(encoding.input_ids[0], encoding.sequence_ids(0), encoding.word_ids(0)):
    if s == 1:
        bbox.append(boxes[w])
    elif i == tokenizer.sep_token_id:
        bbox.append([1000] * 4)
    else:
        bbox.append([0] * 4)
encoding["bbox"] = torch.tensor([bbox])

word_ids = encoding.word_ids(0)
outputs = model(**encoding)
loss = outputs.loss
start_scores = outputs.start_logits
end_scores = outputs.end_logits
start, end = word_ids[start_scores.argmax(-1)], word_ids[end_scores.argmax(-1)]
print(" ".join(words[start : end + 1]))

So i can understand how the inference work with the activation of the logits to determinate the beginning and the end of the answer.

But now i want to use the V3 model and i found this on the doc :

from transformers import AutoProcessor, AutoModelForQuestionAnswering
from datasets import load_dataset
import torch

processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)
model = AutoModelForQuestionAnswering.from_pretrained("microsoft/layoutlmv3-base")

dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train")
example = dataset[0]
image = example["image"]
question = "what's his name?"
words = example["tokens"]
boxes = example["bboxes"]

encoding = processor(image, question, words, boxes=boxes, return_tensors="pt")
start_positions = torch.tensor([1])
end_positions = torch.tensor([3])

outputs = model(**encoding, start_positions=start_positions, end_positions=end_positions)
loss = outputs.loss
start_scores = outputs.start_logits
end_scores = outputs.end_logits

Firstly can someone exlain to me what are the start and the end positions in the model arguments

And now when i try to execute the same inference as above, it don’t give me the attempt results.

If i understood, the logits are the probability of the beggining and the ending of the sentence. We want the argument of the best probability and we give it to the “word_ids” list to get the index of the word. And then we can search on the input words list to get the word ?

if anyone can help me i will be very grateful. I’ve been thinking about this for days

nielsr · January 23, 2023, 1:14pm

Hi,

The model outputs start_scores and end_scores, which are logits (unnormalized scores) that indicate which token is at the start, and which token is at the end of the answer. One can normalize them using softmax to turn them into probabilities.

However, to turn these scores into actual predictions, we just need to take the highest score (for both start_scores and end_scores), which will give us the index of the respective start and end token.

answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

Next, we can use the processor (actually the tokenizer, which is used behind the scenes) to decode the predicted answer:

predict_answer_tokens = encoding.input_ids[0, answer_start_index : answer_end_index + 1]
print(processor.decode(predict_answer_tokens, skip_special_tokens=True))

Bapt120 · January 23, 2023, 2:54pm

Hi nielsr,

Thanks a lot for your answer, i can understand now.

Have a nice day

Topic		Replies	Views
Get the Q&A in LayoutLMv2 in text form Models	1	449	February 7, 2022
LayoutLMV3 inference without label 🤗Transformers	0	99	May 28, 2024
Getting links out of LayoutLM Beginners	0	308	November 5, 2021
LayoutLMV3 for Token Classification 🤗Transformers	7	4432	June 19, 2025
LayoutLMv3 inference - bboxes are incorrect 🤗Transformers	0	115	May 10, 2024

LayoutLMv3 Q/A Inference

Related topics