Hello to all the community and thanks for what you brings to AI today,
I am searching for a model for a specific use-case.
I need to extract each requirement from what we call a “Requirements matrix” which is a PDF file. These requirements needs to be extracted into a tabular form.
My opinion is to use “Feature extraction” models, am I right ?
Do you know a model, or a kind of model, that could specifically fit for this use-case please ?
How about LayoutLM? The following explanation is a little out of date, and V3 has already been released.
LayoutLM
To address your requirement of extracting data from a “Requirements matrix” PDF into a tabular form, using models specifically designed for table extraction would be more effective than general “Feature extraction” models. Based on the sources provided, here are the relevant models and approaches:
Table Transformer Models: Microsoft’s Table Transformer models are specifically designed for table extraction from documents. These models can detect and extract tables from PDFs, making them suitable for extracting requirements into a structured format [1][3].
LayoutLM: LayoutLM is another model that excels in table detection and extraction. It can handle the layout of PDF documents, including tables, and is recommended for extracting structured data like requirements from PDFs [2].
Integration with OCR: For PDFs that require Optical Character Recognition (OCR) to extract text, models that integrate OCR capabilities can be used alongside the table extraction models to ensure accurate data retrieval [5].
Recommendation: Use Microsoft’s Table Transformer models or LayoutLM for extracting the “Requirements matrix” from your PDF. These models are tailored for table extraction and can be integrated with OCR tools for improved accuracy.
For implementation, refer to the code snippets provided in Source [3], which demonstrates how to use the TableTransformerModel for table detection and extraction.
I see. So I think this post will probably be a clue. Unless it’s about processing the PDF afterwards, there are quite a few people who will try to extract the content or summarize it.
Thanks a lot for helping me.
From what I can see, there is not ready-to-use model for my case, I am go to train a model to fit my use-cases, what kind of model should I prefer to achieve these use cases please ?
What comes in mind first is BERT but is it not a bit oudated ?
Then, regarding the datasets, do you recommend any blog posts, videos, etc.. that I could follow to build the dataset I will need to use to train it ? I do not see any existing in HF that would contain data such as “requirements document”
BERT is now an ancestor. There is also ModernBert. I think it would be a good idea to find a model that suits you by referring to the benchmarks listed on the leaderboard.
There is no set method for creating datasets. Or rather, exploring that is a fairly important factor in generative AI research. However, I think the courses, Hugging Face blog and Cookbook are useful.