LayoutLMv3 Q/A Inference

Hi , i’m a begginer on this platform. For my master degree’s project i have to use the LayoutLM model (and more precisely for question answering on documents).

I have few questions about the inference of the model for Q/A.

When i read the documentation i found this for the inference of the LayoutLMv1 Q/A model :

from transformers import AutoTokenizer, LayoutLMForQuestionAnswering
from datasets import load_dataset
import torch

tokenizer = AutoTokenizer.from_pretrained("impira/layoutlm-document-qa", add_prefix_space=True)
model = LayoutLMForQuestionAnswering.from_pretrained("impira/layoutlm-document-qa", revision="1e3ebac")

dataset = load_dataset("nielsr/funsd", split="train")
example = dataset[0]
question = "what's his name?"
words = example["words"]
boxes = example["bboxes"]

encoding = tokenizer(
    question.split(), words, is_split_into_words=True, return_token_type_ids=True, return_tensors="pt"
)
bbox = []
for i, s, w in zip(encoding.input_ids[0], encoding.sequence_ids(0), encoding.word_ids(0)):
    if s == 1:
        bbox.append(boxes[w])
    elif i == tokenizer.sep_token_id:
        bbox.append([1000] * 4)
    else:
        bbox.append([0] * 4)
encoding["bbox"] = torch.tensor([bbox])

word_ids = encoding.word_ids(0)
outputs = model(**encoding)
loss = outputs.loss
start_scores = outputs.start_logits
end_scores = outputs.end_logits
start, end = word_ids[start_scores.argmax(-1)], word_ids[end_scores.argmax(-1)]
print(" ".join(words[start : end + 1]))

So i can understand how the inference work with the activation of the logits to determinate the beginning and the end of the answer.

But now i want to use the V3 model and i found this on the doc :

from transformers import AutoProcessor, AutoModelForQuestionAnswering
from datasets import load_dataset
import torch

processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)
model = AutoModelForQuestionAnswering.from_pretrained("microsoft/layoutlmv3-base")

dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train")
example = dataset[0]
image = example["image"]
question = "what's his name?"
words = example["tokens"]
boxes = example["bboxes"]

encoding = processor(image, question, words, boxes=boxes, return_tensors="pt")
start_positions = torch.tensor([1])
end_positions = torch.tensor([3])

outputs = model(**encoding, start_positions=start_positions, end_positions=end_positions)
loss = outputs.loss
start_scores = outputs.start_logits
end_scores = outputs.end_logits

Firstly can someone exlain to me what are the start and the end positions in the model arguments

And now when i try to execute the same inference as above, it don’t give me the attempt results.

If i understood, the logits are the probability of the beggining and the ending of the sentence. We want the argument of the best probability and we give it to the “word_ids” list to get the index of the word. And then we can search on the input words list to get the word ?

if anyone can help me i will be very grateful. I’ve been thinking about this for days

Hi,

The model outputs start_scores and end_scores, which are logits (unnormalized scores) that indicate which token is at the start, and which token is at the end of the answer. One can normalize them using softmax to turn them into probabilities.

However, to turn these scores into actual predictions, we just need to take the highest score (for both start_scores and end_scores), which will give us the index of the respective start and end token.

answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

Next, we can use the processor (actually the tokenizer, which is used behind the scenes) to decode the predicted answer:

predict_answer_tokens = encoding.input_ids[0, answer_start_index : answer_end_index + 1]
print(processor.decode(predict_answer_tokens, skip_special_tokens=True))

Hi nielsr,

Thanks a lot for your answer, i can understand now.

Have a nice day