Multi-page Document Classification

anon86149130 · June 15, 2023, 10:34am

Sorry for the long post but I needed it to be able to capture all the details and questions.

I am working on multi-page document image classification problem and am kind of confused on what approach or model architecture to follow. Here’s is the problem statement:

The problem is to classify a document-set into one of the N classes. A document-set is constituted of one or more scanned document images, thus making it a multi-page document image classification problem. The number of pages in a document-set may vary from 1 to 50. The classification follows a set of business rules, which can be complex and difficult to extract and identify from the images, which is why instead of extracting features for each business rule separately, I am trying to train a model which can learn these rules on its own and just give me the final class label.

I want to use both image and text features of the documents to make a robust model. Here’s a rudimentary approach I have in mind so far:

Use an image transformer model (e.g. DiT) to extract document image embeddings for a single page
Use a language transformer model (e.g. Bert) to extract document text embeddings for a single page
Concatenate the two embeddings and get a final embedding vector a single page
ALTERNATIVE to STEP 1,2,3 - Or should I use a single model that captures both image and text features to generate the page embeddings such as LayoutLMv3 or Donut?
Do this for all the pages in that document-set
Concatenate final embedding vectors from all the pages in the document-set and pass them through a fully connected layer that will predict the final class labels.

Here are some of the concerns I have:

First of all, I am not sure if the above architecture will work or not. E.g. Does it make sense to simply concatenate the embeddings to combine features of multiple pages? If not, how can I do that effectively?
Now, since the number of pages in the document-set are also variable, how do I capture that in the architecture? I was thinking of adding blank images and blank text (sort of a padding) to bring every document-set to the same length. Would that work?
Finally, I am not sure which components should be trainable and which should not be. E.g. The base image and language transformers - should I keep them trainable or just use the pre-trained models to generate the image and text embeddings and then rest of the layers which combine their features can be trainable. Ideas?
If I am choosing to train the base models also, then what framework should I use? Do I need to write entire architecture in PyTorch? Or Can I use something like HuggingFace to connect these components together? Any sample code would be helpful here as I have not written custom architectures before.
Another relatively small concern. All the images are of slightly different dimensions. Is padding the way to resolve that too to bring them all to one common size?

To add more context, this is insurance claim classification problem and just like in a claim, there can be multiple pages (forms, medical records, letters, bills etc…), same is the case here. A claim is what I’ve referred to as a document-set in the question.

jxue005 · October 31, 2023, 9:42pm

I am having a similar task. May I ask how you handle this problem in the end? Any feedback is appreciated. Thank you!

sarwarmursalincuet · March 4, 2024, 8:23am

I was also exploring with LayoutLMv3 for document classification by training with my dataset. The training and evaluation went well and my model is working well .

But my datasets are basically images from PDFs where In each pdf contains multiple document type like (commercial invoice, bill of lading, packing list , etc) and each type can have multiple pages. I can classify document types but in each type classifying which is the first, second … page is what I need to work. Have you got any lead to work with multiple page handling when classifying documents from images/pdfs?

pkappus · March 22, 2024, 9:18am

Hey, I was reading through your post, because I’m facing a similar problem (I don’t know the number of pages beforehand though). Would you be able to report how you managed to do it and how its working ?

hiraltalsaniya · August 5, 2025, 4:50pm

I am also working on similar problem statement.
Have you find any solution of given problem?

Topic		Replies	Views
How to represent paginated documents as a single instance of training data for whole document classification? 🤗Transformers	7	2096	May 27, 2022
How to represent paginated documents as a single training data instance 🤗Transformers	2	615	May 16, 2022
Multi-input classification (images + Texts) Beginners	6	1145	February 18, 2024
Finetune LayoutLM for multilabel document image classification Models	0	433	July 18, 2023
Any Multi Modal LLMs that take direct pdf + text as input? 🤗Transformers	2	1966	October 10, 2024

Multi-page Document Classification

Related topics