I am going through the LayoutLMv2 inference tutorial by @nielsr. This is a great tutorial to understand how LayoutLM is working.
The final output of the model is in the form of bounding boxes on top of the invoice image. I would like to get the Question & Answer pair as the output (JSON if possible). My questions are -
- How can I get the Question and Answer text (instead of the bounding boxes)
- How do I cluster the words in Questions/ Answer (right now there can be multiple bounding boxes for a single Question/ Answer
Any help in this regard would be appreciated.