I’m an intern at a company that provides AI-based solutions to pharma companies. I’m currently working on a email classification project, specifically, we’re trying improve it’s accuracy and ultimately reduce the False Negative Rate(FNR). In the current deployment, we’re taking an email conversation thread and passing it as a prompt to a proprietary LLM and it reads and returns a class as an output. For the last 1.5 months, we’ve been trying to improve the system prompt to the LLM to improve the accuracy and FNR. We’ve managed to get the accuracy upto 70% and FNR down to ~2%(1.74%). I have the following questions:
Is it possible for a general purpose LLM to have a >90% accuracy on a specialized dataset by way of prompting?
We have ~800 training samples. Can we perhaps fine-tune a text classification model from hugging face?
The email threads are very dirty. Some are 1500+ lines long, with lots of empty lines and long sign offs(address, contact info, disclaimers, etc). This could potentially be a problem for LLMs in my view. Is there any way to clean these email threads?
Any advise or resources would greatly appreciated. Thanks!
As a baseline, I’d try a simple naive bayes/SVM text classifier using sklearn. Then, you could turn to fine-tune a bigger model like BERT/DeBERTa/RoBERTa, based on this guide.
If performance still isn’t efficient then you could have a look at LLMs (either API-based or fine-tuning on a custom dataset using HF tooling).
@nielsr I did try out a gradient boosting classifier as a baseline, using n-grams as features(uni, bi and tri). Got 90% accuracy(on cross-validation). However, I’ll give naive bayes and SVM a try as well.
The main reason I want to try out fine-tuning a model like BERT is because we’re already using an API-based proprietary LLM(out of the box) and during inference it sometimes uses it’s own world knowledge(despite asking it not to do so in the prompt) to make predictions which is the primary cause of misclassifications. For e.g. when trying to determine if a medicine is being administered off-label, it tries to check for itself if the medication is approved for the condition without trying to find clues in the email.
The LLM is not able to pick up on the specific patterns, keywords and nuances in the dataset which only comes through training/fine-tuning.
The company for whom we’re developing this solution sends us a batch of emails every month. I wanted to know if fine-tuning BERT(or other variants) through online learning would be a practical approach.