I’m trying to use LayoutLMv2 to extract information from some invoices pictures.
So far, and based on what it’s here, I’ve run the following:
`from transformers import LayoutLMv2Processor, LayoutLMv2ForQuestionAnswering
from PIL import Image
import torch
processor = LayoutLMv2Processor.from_pretrained(“microsoft/layoutlmv2-base-uncased”)
model = LayoutLMv2ForQuestionAnswering.from_pretrained(“microsoft/layoutlmv2-base-uncased”)
image = Image.open(“name_of_your_document - can be a png file, pdf, etc.”).convert(“RGB”)
question = “what’s his name?”
encoding = processor(image, question, return_tensors=“pt”)
start_positions = torch.tensor([1])
end_positions = torch.tensor([3])
outputs = model(**encoding, start_positions=start_positions,
end_positions=end_positions)
loss = outputs.loss
start_scores = outputs.start_logits
end_scores = outputs.end_logits`
However, I’m not exactly sure, how from this we can extract the words, and not just a visual box… For a visual box, there are a few gradio apps here on the spaces links. However, all the examples draw boxes on the picture, and I want to extract the answer as text.