Text data labeling


Task: Classify customer emails into relevant categories based on their content.

Data: DataFrame containing customer emails. I have a dataset of customer emails stored in a data frame. Each email pertains to a specific issue the customer encountered. My goal is to categorize these emails automatically. For example, an email about problems with food quality would be categorized as “Food Quality Issue,” while an email regarding payment difficulties would be categorized as “Payment Issue.”

The challenge lies in the unknown number of potential categories. There could be a vast range of issues customers might contact us about.

My plan is to address this challenge in two steps:

  1. Data Labeling: I need to label a representative sample of emails from the data frame. This labeling process involves assigning each email a category that accurately reflects the customer’s concern.
  2. Classifier Model Training: Once I have a labeled dataset, I can use it to train a machine learning model. This model will then be able to automatically categorize new, unseen emails based on the patterns it learns from the labeled data. I need answer for first part , how can i label data.

My approach was to generate embeddings of email and then apply clustering,but result is not good. Please help me to solve this problem and is any thing to apply before generating embeddings.