Get the Q&A in LayoutLMv2 in text form

I am going through the LayoutLMv2 inference tutorial by @nielsr. This is a great tutorial to understand how LayoutLM is working.

The final output of the model is in the form of bounding boxes on top of the invoice image. I would like to get the Question & Answer pair as the output (JSON if possible). My questions are -

  1. How can I get the Question and Answer text (instead of the bounding boxes)
  2. How do I cluster the words in Questions/ Answer (right now there can be multiple bounding boxes for a single Question/ Answer

Any help in this regard would be appreciated.


We plan to add LayoutLMv2ForRelationExtraction (that allows you to do just that) to the library. See here to follow the progress (it also includes a link to a Colab notebook).