Models for Document Image Annotation Without OCR

Hi everyone!

I’m currently researching models for working with documents based on both visual and textual information. I want to train a model capable of annotating text document images with classes and providing coordinates. What I mean is: the model should highlight areas on an image that belong to specific topics.

For example, on promotional images, it should detect areas (labels) like “contact information,” “product details,” etc., and also provide coordinates (bboxes) for each label.

I have a dataset consisting of (image, label, bbox). One bbox does not correspond to a single word; instead, each bbox belongs to one label and highlights an entire region of text within it.

The task is to input an image during inference and get a label + bbox as output.

I’ve already tried the Florence2 model and am currently working with LayoutLMv2. However, the issue with LayoutLMv2 is that it requires OCR of the text within the bbox during training and also needs both the image and OCR text as input during inference. This approach doesn’t suit me, as OCR might perform poorly in real-world tasks, which could degrade the model’s quality.

Does anyone know of models capable of handling similar tasks without relying on OCR and that can process textual information directly from images?

Thank you so much for your time and help!

Best,

Paul

1 Like

Have a look at layout-parser specifically for document segmentation without ocr and Metas detectron2 model detectron2, which layout-parser uses for a lot of segmentation tasks. It can and has been fine-tuned to do such things (for example in layout-parser). With some fine-tuning you could get a pretty good model I think.
It does pretty much what you need, i.e. inputing and image and returning labels for different categories, each with a bounding box and some meta information.

2 Likes