Hi,
first of all I’d like to thank hugging face team for their awesome work, the platform documentation is great for beginners like me.
My question is more about method than a technical question. Indeed, I work for a company where thousands of emails are exhanged with our customers every month. These mails are related to contracts identified by a unique reference. To keep track of every exchange for a given contract (which can be handled by different employees in the company), every email is “classified” what is to say linked to a contract reference. This task is highly time consuming and I’m considering using AI to do it automatically, but I don’t know where to start.
Although I think it’s a typical classification problem, my issue here is that there are more classes than samples. Indeed, there are thousands of contracts but maybe 50 linked email for each of them. Thus I don’t think it would make sense to train a model where classes are contract references. Moreover new contracts popup every day so I would have to constently train my model.
Another approach I thought about is to feed a model with the textual content of the email I want to classify. Given the different email addresses is the mail, it is quite easy to retrieve a bunch of 30 “candidate contracts”, all the challenge is to pick up the one that matches the email to classify. In this context, I would have a “30 classes classification problem” where for each email, I would gather the candidate contracts data to populate my 30 (max) classes. To summerize I would have a training dataset with, for each email, a textual content (the email), 30max classes (contract refrerences candidates and all related date (title, description, date, provider emails…) and one contract reference (the one actually matching the email).
Does it make any sense to proceed like this ?