Hi everyone, I am excited to start my journey on NLP
Right now, I am training a customized NER model that can extract entities from unstructured data (PDFs mainly), aggregate them into a data frame and output it as a csv eventually. I am using different types of documents, such as ride-share receipts, hotel bills, electric bills, etc.
I have implemented an end-to-end NER pipeline leveraging Amazon Comprehend, and have trained the model 3 times with various dataset sizes. I started training the model with only 100 PDF documents, and got an F1 score of 0.72. I then increased my dataset size and trained the model again with 230 documents, and got an F1 score of 0.89.
At this point, I was convinced that increasing the dataset size will improve my model performance; but I also know that I have to collect a diverse dataset as well - for instance, electric bills coming from different utility providers have different layout and format; so, collecting electric bills from various providers would prevent overfitting to just one utility bill format.
I then trained the model using almost 300 documents, after collecting 50 more documents (30 of them are bills from 3 new providers, while I obtained the remaining 20 bills via data augmentation - simply adding watermarks to the documents that I previously have in the training set), but this time the overall performance decreases, notwithstanding performance for some of the features increase and some decrease.
This concludes the description of my problem, and here are my questions:
- Is it too early to make any conclusion on model performance, giving that the dataset size is still relatively small? Why is adding new training data decrease my model performance?
- I am experiencing data poverty, so I am applying data augmentation on existing documents to increase the training dataset volume; however, I am simply adding watermark to the documents without changing any values (e.g. the energy consumption value). Would that be an issue?
- Will building the model from scratch leveraging Hugging Face transformers and spaCy give me better results than leveraging cloud providers?
- I start to think that the actual performance of my second trial may have been misleading - it may not be that good. The reason is that Amazon auto-split my dataset without giving me the ability to perform random split. I am still trying to figure out how to do that on PDF data on Amazon; does anyone know how to?
- For anyone who is familiar with Amazon Comprehend, are the labeled documents re-usable for other tools like spaCy? It would be quite a pain to label everything again if I want to build my model from scratch instead of using cloud providers for sure…
To avoid any confusions: I am referring “1 electric bill” as “1 document”, but Amazon defines “number of documents” to be the number of pages in total - so 1 electric bill that contains 5 pages will translate to “5 documents”.