Named Entity Recognition for PDFs

JetsonEarth · July 22, 2022, 5:04pm

Hi fellow NLP enthusiasts!

I am working on an NER project that could extract information from unstructured data like pdfs and images and output the information into a csv. So far, I have leveraged Amazon Comprehend to successfully build an NER pipeline. I have achieved an F1 score of 0.89, which is good for only having 250+ documents for training, but I want to take it further. There are a few issues with Comprehend that I am not happy with:

First, I cannot customize my model architecture. All I need to do for building an NER model on Comprehend is preparing the training data and feeding it into the AWS console for training. I don’t have the flexibility to choose my model architecture, just like how I can choose any transformer of choice on spaCy. To me, Comprehend is simply like a black-box solution.
Second, I cannot choose the method for splitting the datasets. I am not 100% sure, but I think Comprehend by default uses a random split. However, I want to have more control over how I want to split my data.

I discovered spaCy and Hugging Face transformers this past week, and am in awe of how powerful these open source tools are. I would love to migrate my workflow to spaCy and compare its performance with the model built leveraging Comprehend.

Here are my questions:

I cannot seem to find a good data annotation tool for pdfs… Almost all of the tools out there accept texts but not pdfs, so I will have to use OCR and convert my pdfs to text first. Are there any tools out there that would allow me to annotate data on PDFs? I would love to use Prodi.gy but its out of my budget
Would converting PDFs to plain text (not using OCR) cause a loss in the spatial information of the document? If that’s the case, would that impair the performance of an NER model?
I have already annotated my PDFs with Amazon SageMaker GroundTruth, however, I am not sure if I can use the annotated files directly on other NLP tools like spaCy. For anyone who is familiar with SageMaker, do you know if this is doable?

Thank you!

patrickthebruce · October 26, 2022, 1:29pm

I am in similar situation, what did you end up doing?

DoubleCortado · February 26, 2023, 2:28pm

Following up

Ihor · December 4, 2023, 8:39am

I can’t answer all of your questions, but converting PDFs to plain text will cause a loss of spatial information for sure. If it’s not important, I would recommend just working with plain text that brings much more flexibility because more models are available.

For anyone who struggles with extracting named entities from PDF, I would like to recommend the article that demonstrates how to extract any named entities by adding zero-shot models capabilities to Spacy pipeline.

nielsr · December 4, 2023, 9:30am

Hi,

It’s recommended to leverage document AI models, which also take layout information as input besides the text. See our blog post for an overview of all models available: Accelerating Document AI.

Particularly models like LayoutLM, Donut will be suitable for you

Topic		Replies	Views
Improving NER model performance & comparing approaches (Amazon Comprehend vs model from scratch) Beginners	1	1160	April 5, 2024
How to get NER pipeline output to match with spacy's output? 🤗Transformers	3	2098	July 12, 2020
NER on SageMaker Ground Truth annotations Amazon SageMaker	1	677	April 12, 2021
Recommended Hardware for NER Pipeline Model Beginners	1	956	September 24, 2020
Application of a transformer model without fine tuning for NER task Beginners	2	1350	May 31, 2021

Named Entity Recognition for PDFs

Related topics