Email classification, labeling and entity classification/extraction

bhavishya02 · June 5, 2024, 8:29am

I am currently trying to build an application which requires me to go through a user’s emails and classify them and then extract some information. For which I have tried to implement a pipeline with two different models which first classify (used a zero-shot text classification model) the emails and then after the classification is done, they extract information (used a zero-shot entity recognition model). This worked with a varying degree of success. The classification worked 80% of the time and entity recognition worked 20%-30% of the time.

This clearly showed me that the models require fine tuning on a set of real world emails and the enron email dataset isn’t going to cut it and here comes the first challenge which is to go and label the data manually for 1000s of emails which seems pretty time consuming to do.

The other issue is the processing time, I could have used a more advanced LLM but I want to keep the processing times per email as low as possible as there could be 1000s-10000s emails per user to be processed.

So TLDR, How can I fine-tune a zero-shot text classification and a zero-shot entity recognition model. Second, how can I acquire labelled data or label the data myself but speed up the process. Third, how can I speed up the processing times.

Any tips, tools and guidance is greatly appreciated.

juhoinkinen · June 5, 2024, 9:20pm

To which classes you want to classify the emails to? To spam and not-spam (i.e. a binary classification problem) or are there more than two classes in the emails of the Enron dataset (a multiclass problem? If so, can there be just one class assigned to the email, or many classes/labels (multilabel classification problem)?

bhavishya02 · June 6, 2024, 9:05am

So the emails that I am concerned with would contain some sort of payment information regarding the services they use. So the classes would look something like [“subscription”, “one-off”, “payment”, “others”] and yes can be multi-label

Topic		Replies	Views
Advice on an email classification problem Beginners	3	431	August 27, 2024
Is it possible to fine tune zero shot text classification model for our data set? Beginners	0	229	June 19, 2023
NLP Training data Intermediate	0	129	March 14, 2024
Text data labeling Beginners	0	79	June 12, 2024
📧 Method question to solve a specific mail classification problem Beginners	0	439	June 9, 2023

Email classification, labeling and entity classification/extraction

Related topics