Named Entity Recognition for PDFs

Hi fellow NLP enthusiasts! :smiley:

I am working on an NER project that could extract information from unstructured data like pdfs and images and output the information into a csv. So far, I have leveraged Amazon Comprehend to successfully build an NER pipeline. I have achieved an F1 score of 0.89, which is good for only having 250+ documents for training, but I want to take it further. There are a few issues with Comprehend that I am not happy with:

  • First, I cannot customize my model architecture. All I need to do for building an NER model on Comprehend is preparing the training data and feeding it into the AWS console for training. I don’t have the flexibility to choose my model architecture, just like how I can choose any transformer of choice on spaCy. To me, Comprehend is simply like a black-box solution.
  • Second, I cannot choose the method for splitting the datasets. I am not 100% sure, but I think Comprehend by default uses a random split. However, I want to have more control over how I want to split my data.

I discovered spaCy and Hugging Face transformers this past week, and am in awe of how powerful these open source tools are. I would love to migrate my workflow to spaCy and compare its performance with the model built leveraging Comprehend.

Here are my questions:

  • I cannot seem to find a good data annotation tool for pdfs… Almost all of the tools out there accept texts but not pdfs, so I will have to use OCR and convert my pdfs to text first. Are there any tools out there that would allow me to annotate data on PDFs? I would love to use Prodi.gy but its out of my budget :frowning:
  • Would converting PDFs to plain text (not using OCR) cause a loss in the spatial information of the document? If that’s the case, would that impair the performance of an NER model?
  • I have already annotated my PDFs with Amazon SageMaker GroundTruth, however, I am not sure if I can use the annotated files directly on other NLP tools like spaCy. For anyone who is familiar with SageMaker, do you know if this is doable?

Thank you!