Sorry for the long post but I needed it to be able to capture all the details and questions.
I am working on multi-page document image classification problem and am kind of confused on what approach or model architecture to follow. Here’s is the problem statement:
The problem is to classify a document-set into one of the N classes. A document-set is constituted of one or more scanned document images, thus making it a multi-page document image classification problem. The number of pages in a document-set may vary from 1 to 50. The classification follows a set of business rules, which can be complex and difficult to extract and identify from the images, which is why instead of extracting features for each business rule separately, I am trying to train a model which can learn these rules on its own and just give me the final class label.
I want to use both image and text features of the documents to make a robust model. Here’s a rudimentary approach I have in mind so far:
- Use an image transformer model (e.g. DiT) to extract document image embeddings for a single page
- Use a language transformer model (e.g. Bert) to extract document text embeddings for a single page
- Concatenate the two embeddings and get a final embedding vector a single page
- ALTERNATIVE to STEP 1,2,3 - Or should I use a single model that captures both image and text features to generate the page embeddings such as LayoutLMv3 or Donut?
- Do this for all the pages in that document-set
- Concatenate final embedding vectors from all the pages in the document-set and pass them through a fully connected layer that will predict the final class labels.
Here are some of the concerns I have:
- First of all, I am not sure if the above architecture will work or not. E.g. Does it make sense to simply concatenate the embeddings to combine features of multiple pages? If not, how can I do that effectively?
- Now, since the number of pages in the document-set are also variable, how do I capture that in the architecture? I was thinking of adding blank images and blank text (sort of a padding) to bring every document-set to the same length. Would that work?
- Finally, I am not sure which components should be trainable and which should not be. E.g. The base image and language transformers - should I keep them trainable or just use the pre-trained models to generate the image and text embeddings and then rest of the layers which combine their features can be trainable. Ideas?
- If I am choosing to train the base models also, then what framework should I use? Do I need to write entire architecture in PyTorch? Or Can I use something like HuggingFace to connect these components together? Any sample code would be helpful here as I have not written custom architectures before.
- Another relatively small concern. All the images are of slightly different dimensions. Is padding the way to resolve that too to bring them all to one common size?
To add more context, this is insurance claim classification problem and just like in a claim, there can be multiple pages (forms, medical records, letters, bills etc…), same is the case here. A claim is what I’ve referred to as a document-set in the question.