Hi everyone!
I’m currently researching models for working with documents based on both visual and textual information. I want to train a model capable of annotating text document images with classes and providing coordinates. What I mean is: the model should highlight areas on an image that belong to specific topics.
For example, on promotional images, it should detect areas (labels) like “contact information,” “product details,” etc., and also provide coordinates (bboxes) for each label.
I have a dataset consisting of (image, label, bbox). One bbox does not correspond to a single word; instead, each bbox belongs to one label and highlights an entire region of text within it.
The task is to input an image during inference and get a label + bbox as output.
I’ve already tried the Florence2 model and am currently working with LayoutLMv2. However, the issue with LayoutLMv2 is that it requires OCR of the text within the bbox during training and also needs both the image and OCR text as input during inference. This approach doesn’t suit me, as OCR might perform poorly in real-world tasks, which could degrade the model’s quality.
Does anyone know of models capable of handling similar tasks without relying on OCR and that can process textual information directly from images?
Thank you so much for your time and help!
Best,
Paul