What would be the most suitable AI tool for automating document classification and extracting relevant data for search functionality?
I have a collection of domain-specific documents, including medical certificates, award certificates, and other certificates and handwritten forms. Some of these documents contain a mix of printed and handwritten text, while others are entirely printed. My goal is to build a system that can automatically classify these documents, extract key information (e.g., names and other relevant details), and enable users to search for a person’s name to retrieve all associated documents stored in the system.
Since I have a dataset of these documents, I can use it to train or fine-tune a model for improved accuracy in text extraction and classification. I am considering OCR-based solutions like Google Document AI and TroOCR, as well as transformer models and vision-language models (VLMs) such as Qwen2-VL, MiniCPM, and GPT-4V. Given my dataset and requirements, which AI tool or combination of tools would be the most effective for this use case?
If the aim is classification rather than content analysis, then perhaps the strictness of every single word is not required to that extent, so I think Qwen is fine. Well, it’s also good in terms of performance, but it excels in terms of multilingual performance. If you want something smaller, there is also SmolVLM2, but it may be too small.
I think it would be good to extract text using a script (there are libraries that extract text from PDFs, etc.) or VLM, and then classify the text using BERT or LLM.
If you can accomplish this using BERT or a derivative model, it will be the most cost-effective. LLM naturally has high classification performance, but it’s also large…
I only need to classify around five types of documents, and the document type is already explicitly stated at the top, so classification isn’t the main challenge. The key requirement is extracting both printed and handwritten fields—specifically first names, last names, and other relevant details. Each person has all five document types, and their names are already present in each.
Given this, would it make sense to use a lightweight OCR system (e.g., TrOCR or a fine-tuned Tesseract) for text extraction, followed by a NER model like BERT-based approaches or LayoutLM for structured data extraction? Would a smaller VLM still provide any advantages here, or would it be unnecessary overhead?
Text extraction is very difficult problem to solve for several reasons. If you will be implementing your own (low-budget) solution based on OCR text there are several tasks to consider first:
Preparing text data:
Cleanup trash characters (no newlines needed, no structure in text needed)
Standardize input text to predefined language locale (date, numbers, currency, …)
Tokenization:
Standard or trained tokenizers have the same problem. They might not contain or represent all data that will be in the input and output - good for classification - devestating for extraction. For example currency, date, numbers in general are difficult to extract deterministically without halucinations or pre-learned tokens.
I recommend looking into character-based tokenization, maybe using ansi or other basic charset and adding more characters if required. You can then use character position to train model.
Cost function:
Depending on task at hand you can decide what cost function to use.
Text feature position extraction (bidirectional LSTM model , position based 0/1):
nn.BCELoss()
Text classification (transformer model):
nn.CrossEntropyLoss()
You might have to do more than one model to achieve desired result but it can be done. For example if you train LSTM to look for single class of data it will work great otherwise not so much.
As InXis says, it’s extremely difficult to ensure accuracy. On the other hand, if you just want to summarize the gist of the text, you can get away with just using a good VLM (give it a command like “summarize the text in the image”).
If the content is important, it’s difficult to do with the size of SmolVLM, so why not try the latest Qwen 2.5VL, which was released just the other day? There are several size variations.
It was quite powerful at version 2, so it might just work.
Thanks for the detailed explanation! I really appreciate the breakdown.
From what I understand:
For classification, a transformer model (like BERT) with standard tokenization should work well.
For text extraction, I should consider using character-based tokenization to handle names, dates, and numbers more reliably.
Depending on the task, I might need different models (e.g., a bidirectional LSTM with BCELoss for position extraction and a transformer with CrossEntropyLoss for classification).
A few clarifications about my dataset:
Three out of five document types follow the same format, while two types have different layouts each, so handling structural differences is a consideration. And also, these documents will be scanned
Only one document type includes both printed and handwritten text, where a person would manually fill in fields like their name.
Would you recommend using an OCR-specific model like TrOCR, or should I just apply a generic OCR system (e.g., Tesseract or Google Document AI) and process the extracted text separately?
If I were to train a bidirectional LSTM for position extraction, how much labeled data would typically be needed to get reliable performance, especially considering the mix of structured and semi-structured documents?
Thanks for the suggestion! Qwen 2.5-VL does seem like a strong option, especially with its improved text recognition, structured output capabilities, and document parsing using QwenVL HTML.
Would you recommend using an OCR-specific model like TrOCR,
Although it’s a little different from OCR, there may be a way to have scanned PDFs read by a model like this. Structured document analysis is a field with a lot of demand, so even just searching this forum for PDFs will show you quite a few attempts. And there’s still no definitive answer…
If I were to train a bidirectional LSTM for position extraction, how much labeled data would typically be needed to get reliable performance, especially considering the mix of structured and semi-structured documents?
This is not limited to this case, but I strongly recommend asking questions about model training on HF Discord. Although it’s not always the same people (because it’s a Discord for mutual support between users, including HF staff, not for user support), there are usually people training LLM and VLM on a daily basis, so you can get practical and useful information.
I recommend asking questions during the day in US time.
Wanted to mention, if your documents contain forms and/or tables, it might be helpful to first do table recognition and then structure recognition to pull apart and manage individual cells instead of the whole document at once. That way if you use OCR to see “FIRST NAME” in a table_cell you know the rest of the text will be the first name portion.