Models for Document Image Annotation Without OCR

PaulSpirin · December 6, 2024, 3:40pm

Hi everyone!

I’m currently researching models for working with documents based on both visual and textual information. I want to train a model capable of annotating text document images with classes and providing coordinates. What I mean is: the model should highlight areas on an image that belong to specific topics.

For example, on promotional images, it should detect areas (labels) like “contact information,” “product details,” etc., and also provide coordinates (bboxes) for each label.

I have a dataset consisting of (image, label, bbox). One bbox does not correspond to a single word; instead, each bbox belongs to one label and highlights an entire region of text within it.

The task is to input an image during inference and get a label + bbox as output.

I’ve already tried the Florence2 model and am currently working with LayoutLMv2. However, the issue with LayoutLMv2 is that it requires OCR of the text within the bbox during training and also needs both the image and OCR text as input during inference. This approach doesn’t suit me, as OCR might perform poorly in real-world tasks, which could degrade the model’s quality.

Does anyone know of models capable of handling similar tasks without relying on OCR and that can process textual information directly from images?

Thank you so much for your time and help!

Best,

Paul

paulleo13 · December 12, 2024, 9:42am

Have a look at layout-parser specifically for document segmentation without ocr and Metas detectron2 model detectron2, which layout-parser uses for a lot of segmentation tasks. It can and has been fine-tuned to do such things (for example in layout-parser). With some fine-tuning you could get a pretty good model I think.
It does pretty much what you need, i.e. inputing and image and returning labels for different categories, each with a bounding box and some meta information.

Topic		Replies	Views
Looking for OCR post-processing for Visual Document Understanding Research	0	638	December 15, 2023
LayoutLMV3 for Token Classification 🤗Transformers	7	4358	June 19, 2025
Training a model for a PDF with OCR - where to begin? Beginners	4	10608	October 27, 2024
Finetune LayoutLM for multilabel document image classification Models	0	427	July 18, 2023
Image Token classification LayoutLMv3 Beginners	0	354	November 7, 2023

Models for Document Image Annotation Without OCR

Related topics