Email classification, labeling and entity classification/extraction

I am currently trying to build an application which requires me to go through a user’s emails and classify them and then extract some information. For which I have tried to implement a pipeline with two different models which first classify (used a zero-shot text classification model) the emails and then after the classification is done, they extract information (used a zero-shot entity recognition model). This worked with a varying degree of success. The classification worked 80% of the time and entity recognition worked 20%-30% of the time.

This clearly showed me that the models require fine tuning on a set of real world emails and the enron email dataset isn’t going to cut it and here comes the first challenge which is to go and label the data manually for 1000s of emails which seems pretty time consuming to do.

The other issue is the processing time, I could have used a more advanced LLM but I want to keep the processing times per email as low as possible as there could be 1000s-10000s emails per user to be processed.

So TLDR, How can I fine-tune a zero-shot text classification and a zero-shot entity recognition model. Second, how can I acquire labelled data or label the data myself but speed up the process. Third, how can I speed up the processing times.

Any tips, tools and guidance is greatly appreciated.

1 Like

To which classes you want to classify the emails to? To spam and not-spam (i.e. a binary classification problem) or are there more than two classes in the emails of the Enron dataset (a multiclass problem? If so, can there be just one class assigned to the email, or many classes/labels (multilabel classification problem)?

So the emails that I am concerned with would contain some sort of payment information regarding the services they use. So the classes would look something like [“subscription”, “one-off”, “payment”, “others”] and yes can be multi-label