Hi , i’m a begginer on this platform. For my master degree’s project i have to use the LayoutLM model (and more precisely for question answering on documents).
I have few questions about the inference of the model for Q/A.
When i read the documentation i found this for the inference of the LayoutLMv1 Q/A model :
from transformers import AutoTokenizer, LayoutLMForQuestionAnswering
from datasets import load_dataset
import torch
tokenizer = AutoTokenizer.from_pretrained("impira/layoutlm-document-qa", add_prefix_space=True)
model = LayoutLMForQuestionAnswering.from_pretrained("impira/layoutlm-document-qa", revision="1e3ebac")
dataset = load_dataset("nielsr/funsd", split="train")
example = dataset[0]
question = "what's his name?"
words = example["words"]
boxes = example["bboxes"]
encoding = tokenizer(
question.split(), words, is_split_into_words=True, return_token_type_ids=True, return_tensors="pt"
bbox = []
for i, s, w in zip(encoding.input_ids[0], encoding.sequence_ids(0), encoding.word_ids(0)):
if s == 1:
elif i == tokenizer.sep_token_id:
bbox.append([1000] * 4)
bbox.append([0] * 4)
encoding["bbox"] = torch.tensor([bbox])
word_ids = encoding.word_ids(0)
outputs = model(**encoding)
loss = outputs.loss
start_scores = outputs.start_logits
end_scores = outputs.end_logits
start, end = word_ids[start_scores.argmax(-1)], word_ids[end_scores.argmax(-1)]
print(" ".join(words[start : end + 1]))
So i can understand how the inference work with the activation of the logits to determinate the beginning and the end of the answer.
But now i want to use the V3 model and i found this on the doc :
from transformers import AutoProcessor, AutoModelForQuestionAnswering
from datasets import load_dataset
import torch
processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)
model = AutoModelForQuestionAnswering.from_pretrained("microsoft/layoutlmv3-base")
dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train")
example = dataset[0]
image = example["image"]
question = "what's his name?"
words = example["tokens"]
boxes = example["bboxes"]
encoding = processor(image, question, words, boxes=boxes, return_tensors="pt")
start_positions = torch.tensor([1])
end_positions = torch.tensor([3])
outputs = model(**encoding, start_positions=start_positions, end_positions=end_positions)
loss = outputs.loss
start_scores = outputs.start_logits
end_scores = outputs.end_logits
Firstly can someone exlain to me what are the start and the end positions in the model arguments
And now when i try to execute the same inference as above, it don’t give me the attempt results.
If i understood, the logits are the probability of the beggining and the ending of the sentence. We want the argument of the best probability and we give it to the “word_ids” list to get the index of the word. And then we can search on the input words list to get the word ?
if anyone can help me i will be very grateful. I’ve been thinking about this for days