Using UDOP for layout analysis

Hello,

I am very interested in the newly added UDOP model. I would like to use it for layout analysis and by reading the documentation + paper, I found that it should be already pre-trained for this task using a prompt like this one: “Layout Analysis. Title”. However, when I try to run inference with this prompt, using the example inference noteboook, the model just seems to do document classification instead - no layout analysis.

So, I thought it should probably be fine-tuned instead and I attempted to fine-tune it on the DocLayNet dataset. The layout analysis output is supposed to consist of so-called “layout tokens” (from the paper), which is a string representation of the layout bboxes after discretising them to a certain layout vocab size. In the paper, they mention the following as an example of a layout token: <50><100><250><300>

However, the UDOP processor doesn’t seem to recognize this text format - it tokenizes this into ['<unk>', '50', '>', '<unk>', '100', '>', '<unk>', '250', '>', '<unk>', '300', '>'], so it doesn’t even have a token for <. I think the “layout tokens” should be part of the tokenizer vocabulary for this task to make sense but there are no tokens in the vocabulary that look like layout tokens.

I found a linked issue in the microsoft codebase (here) but no answer. In this issue, there is a linked notebook where there does appear to be layout tokens in the tokenizer vocabulary (<loc_0>, ..., <loc_499>) . However, these tokens are not there in the Hugging Face implementation.

Any suggestions how I can proceed from here? How can I use layout tokens in the Hugging Face implementation and consequently use the model for layout analysis? Any help is greatly appreciated :slight_smile:

Transformers-Tutorials/UDOP/Layout_analysis_with_UDOP.ipynb at master · NielsRogge/Transformers-Tutorials · GitHub :wink:

2 Likes

This is great, thanks for the quick response Niels :smiley:

So there were 2 things wrong in my code actually:

  1. The model “microsoft/udop-large” doesn’t have any layout tokens in the vocabulary, but “microsoft/udop-large-512” does.
  2. I need to set skip_special_tokens=False when generating the output.

Hi,

Thanks for investigating! Indeed, still need to push the special tokens to the vocabulary of microsoft/udop-large, can do soon.

I have a PR open for it: [UDOP] Add special tokens to tokenizer by NielsRogge · Pull Request #29594 · huggingface/transformers · GitHub.

And indeed, one needs to skip the special tokens when decoding as otherwise those are not included in the final output.

I also saw not that great performance of the layout analysis, this may be due to the fact that I’m using the Tesseract OCR engine, whereas the authors probably used Azure’s Read API.

I noticed that your visualisation of the output bboxes seems to be a bit off. I used the following code instead to unnormalize the output bboxes:

LAYOUT_VOCAB_SIZE = 500

def unnormalize_box(box, image_width, image_height):
    x1 = box[0] / LAYOUT_VOCAB_SIZE * image_width
    y1 = box[1] / LAYOUT_VOCAB_SIZE * image_height
    x2 = box[2] / LAYOUT_VOCAB_SIZE * image_width
    y2 = box[3] / LAYOUT_VOCAB_SIZE * image_height
    return [x1, y1, x2, y2]

and then the layout analysis results look better :slight_smile:

But you’re probably right that a better OCR engine would improve the results :+1:

Thanks again for looking into this!

Oh thanks, still had to look into the paper regarding what these layout tokens represent. Will update my notebook!

Also, I’ve pushed the special tokens for microsoft/udop-large, so it should work: Upload processor · microsoft/udop-large at 95cde2f.

1 Like

Another thing I just learned: using the prompt “Layout analysis on PubLayNet. Text” seems to work better than just “Layout analysis. Text”

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.