Named Entity Recognition for PDFs

Hi fellow NLP enthusiasts! :smiley:

I am working on an NER project that could extract information from unstructured data like pdfs and images and output the information into a csv. So far, I have leveraged Amazon Comprehend to successfully build an NER pipeline. I have achieved an F1 score of 0.89, which is good for only having 250+ documents for training, but I want to take it further. There are a few issues with Comprehend that I am not happy with:

  • First, I cannot customize my model architecture. All I need to do for building an NER model on Comprehend is preparing the training data and feeding it into the AWS console for training. I don’t have the flexibility to choose my model architecture, just like how I can choose any transformer of choice on spaCy. To me, Comprehend is simply like a black-box solution.
  • Second, I cannot choose the method for splitting the datasets. I am not 100% sure, but I think Comprehend by default uses a random split. However, I want to have more control over how I want to split my data.

I discovered spaCy and Hugging Face transformers this past week, and am in awe of how powerful these open source tools are. I would love to migrate my workflow to spaCy and compare its performance with the model built leveraging Comprehend.

Here are my questions:

  • I cannot seem to find a good data annotation tool for pdfs… Almost all of the tools out there accept texts but not pdfs, so I will have to use OCR and convert my pdfs to text first. Are there any tools out there that would allow me to annotate data on PDFs? I would love to use Prodi.gy but its out of my budget :frowning:
  • Would converting PDFs to plain text (not using OCR) cause a loss in the spatial information of the document? If that’s the case, would that impair the performance of an NER model?
  • I have already annotated my PDFs with Amazon SageMaker GroundTruth, however, I am not sure if I can use the annotated files directly on other NLP tools like spaCy. For anyone who is familiar with SageMaker, do you know if this is doable?

Thank you!

1 Like

I am in similar situation, what did you end up doing?

Following up

I can’t answer all of your questions, but converting PDFs to plain text will cause a loss of spatial information for sure. If it’s not important, I would recommend just working with plain text that brings much more flexibility because more models are available.

For anyone who struggles with extracting named entities from PDF, I would like to recommend the article that demonstrates how to extract any named entities by adding zero-shot models capabilities to Spacy pipeline.

Hi,

It’s recommended to leverage document AI models, which also take layout information as input besides the text. See our blog post for an overview of all models available: Accelerating Document AI.

Particularly models like LayoutLM, Donut will be suitable for you