I need a model for requirements extraction

fallais · March 26, 2025, 9:40am

Hello to all the community and thanks for what you brings to AI today,

I am searching for a model for a specific use-case.

I need to extract each requirement from what we call a “Requirements matrix” which is a PDF file. These requirements needs to be extracted into a tabular form.

My opinion is to use “Feature extraction” models, am I right ?

Do you know a model, or a kind of model, that could specifically fit for this use-case please ?

Thanks a lot !

John6666 · March 26, 2025, 11:15am

How about LayoutLM? The following explanation is a little out of date, and V3 has already been released.

LayoutLM

To address your requirement of extracting data from a “Requirements matrix” PDF into a tabular form, using models specifically designed for table extraction would be more effective than general “Feature extraction” models. Based on the sources provided, here are the relevant models and approaches:

Table Transformer Models: Microsoft’s Table Transformer models are specifically designed for table extraction from documents. These models can detect and extract tables from PDFs, making them suitable for extracting requirements into a structured format [1][3].
LayoutLM: LayoutLM is another model that excels in table detection and extraction. It can handle the layout of PDF documents, including tables, and is recommended for extracting structured data like requirements from PDFs [2].
Integration with OCR: For PDFs that require Optical Character Recognition (OCR) to extract text, models that integrate OCR capabilities can be used alongside the table extraction models to ensure accurate data retrieval [5].

Recommendation: Use Microsoft’s Table Transformer models or LayoutLM for extracting the “Requirements matrix” from your PDF. These models are tailored for table extraction and can be integrated with OCR tools for improved accuracy.

For implementation, refer to the code snippets provided in Source [3], which demonstrates how to use the TableTransformerModel for table detection and extraction.

fallais · March 26, 2025, 2:08pm

Thanks a lot for highliting this model to me

The PDF documents will not only contain tables, the requirements can also be in the document like paragraph, or basic sentences.

What I need is to get all requirements, then I will display them in a table.

John6666 · March 26, 2025, 2:34pm

I see. So I think this post will probably be a clue. Unless it’s about processing the PDF afterwards, there are quite a few people who will try to extract the content or summarize it.

fallais · March 31, 2025, 1:14pm

Thanks a lot for helping me.
From what I can see, there is not ready-to-use model for my case, I am go to train a model to fit my use-cases, what kind of model should I prefer to achieve these use cases please ?

What comes in mind first is BERT but is it not a bit oudated ?

Then, regarding the datasets, do you recommend any blog posts, videos, etc.. that I could follow to build the dataset I will need to use to train it ? I do not see any existing in HF that would contain data such as “requirements document”

Thanks a lot !

John6666 · March 31, 2025, 3:43pm

BERT is now an ancestor. There is also ModernBert. I think it would be a good idea to find a model that suits you by referring to the benchmarks listed on the leaderboard.

There is no set method for creating datasets. Or rather, exploring that is a fairly important factor in generative AI research. However, I think the courses, Hugging Face blog and Cookbook are useful.

Topic		Replies	Views
LayoutLM for table detection and extraction Beginners	3	8244	July 11, 2023
LayoutLM for extraction of information from tables Research	1	1525	September 29, 2022
LLM model for table data Languages at Hugging Face	8	41312	July 21, 2024
Model Recommendation for table extraction from PDF Models	3	3971	July 14, 2024
Transformer model for pdf invoice field extraction 🤗Transformers	0	802	January 15, 2024

I need a model for requirements extraction

LayoutLM

Related topics