Hi fellow NLP enthusiasts!
I am working on an NER project that could extract information from unstructured data like pdfs and images and output the information into a csv. So far, I have leveraged Amazon Comprehend to successfully build an NER pipeline. I have achieved an F1 score of 0.89, which is good for only having 250+ documents for training, but I want to take it further. There are a few issues with Comprehend that I am not happy with:
- First, I cannot customize my model architecture. All I need to do for building an NER model on Comprehend is preparing the training data and feeding it into the AWS console for training. I don’t have the flexibility to choose my model architecture, just like how I can choose any transformer of choice on spaCy. To me, Comprehend is simply like a black-box solution.
- Second, I cannot choose the method for splitting the datasets. I am not 100% sure, but I think Comprehend by default uses a random split. However, I want to have more control over how I want to split my data.
I discovered spaCy and Hugging Face transformers this past week, and am in awe of how powerful these open source tools are. I would love to migrate my workflow to spaCy and compare its performance with the model built leveraging Comprehend.
Here are my questions:
- I cannot seem to find a good data annotation tool for pdfs… Almost all of the tools out there accept texts but not pdfs, so I will have to use OCR and convert my pdfs to text first. Are there any tools out there that would allow me to annotate data on PDFs? I would love to use Prodi.gy but its out of my budget
- Would converting PDFs to plain text (not using OCR) cause a loss in the spatial information of the document? If that’s the case, would that impair the performance of an NER model?
- I have already annotated my PDFs with Amazon SageMaker GroundTruth, however, I am not sure if I can use the annotated files directly on other NLP tools like spaCy. For anyone who is familiar with SageMaker, do you know if this is doable?
Thank you!