I am very interested in the newly added UDOP model. I would like to use it for layout analysis and by reading the documentation + paper, I found that it should be already pre-trained for this task using a prompt like this one: “Layout Analysis. Title”. However, when I try to run inference with this prompt, using the example inference noteboook, the model just seems to do document classification instead - no layout analysis.
So, I thought it should probably be fine-tuned instead and I attempted to fine-tune it on the DocLayNet dataset. The layout analysis output is supposed to consist of so-called “layout tokens” (from the paper), which is a string representation of the layout bboxes after discretising them to a certain layout vocab size. In the paper, they mention the following as an example of a layout token: <50><100><250><300>
However, the UDOP processor doesn’t seem to recognize this text format - it tokenizes this into ['<unk>', '50', '>', '<unk>', '100', '>', '<unk>', '250', '>', '<unk>', '300', '>'], so it doesn’t even have a token for <. I think the “layout tokens” should be part of the tokenizer vocabulary for this task to make sense but there are no tokens in the vocabulary that look like layout tokens.
I found a linked issue in the microsoft codebase (here) but no answer. In this issue, there is a linked notebook where there does appear to be layout tokens in the tokenizer vocabulary (<loc_0>, ..., <loc_499>) . However, these tokens are not there in the Hugging Face implementation.
Any suggestions how I can proceed from here? How can I use layout tokens in the Hugging Face implementation and consequently use the model for layout analysis? Any help is greatly appreciated
And indeed, one needs to skip the special tokens when decoding as otherwise those are not included in the final output.
I also saw not that great performance of the layout analysis, this may be due to the fact that I’m using the Tesseract OCR engine, whereas the authors probably used Azure’s Read API.