How to create "Other/garbage" class for classifier (e.g. COVID-19 classifier)

Hi there,

I’m currently training a classifier to classify news sentences in 20 classes of policy measures against COVID-19 like “Lockdown”, “Curfew” etc. It will be part of a system that identifies new policy measure announcements in news articles. (An older development version of the model is here on the model hub)

The issue: In reality, 99% of sentences in news on covid have nothing to do with new policy announcements. So my task is (1) to identify the 1% of sentences which announce a new policy and (2) to classify these sentences. Step (2) works fine, but step (1) is quite difficult.

My current approach: train a classifier for 21 classes: 20 for the 20 policy types and 1 “Other/garbage” class. I’m creating the data for the “Other” class by extracting sentences from news which are semantically very different from the policy announcements. I get decent accuracy on the 20 policy types, but the big issue is the “Other” class, which either creates too many false positives or false negatives.

My question: What are best practices for eliminating “Other” sentences which are not relevant for the classification task? I feel like this must be a very common problem for real-world classification tasks (e.g. same issue for sentiment classifiers, which are only trained to classify as positive/negative - but in reality 99% sentences in e.g. news are neutral). But I couldn’t find literature on the issue.

Would be thankful for advice or hints for literature!


It sounds like any of the algorithms for extractive summarization, like LexRank would apply. You should be able to pick the most meaningful sentences of a news that way and then only classify those, without a seperate class for “other”.
Of course there is no guarantee that you won’t exclude sentences that shouldn’t be in the “other” class.

Another thing I would try is to classify anything as other, that doesn’t have very high probabilities for the rest of the classes. So if your model outputs a low probability, even for the highest-probability class (i.e. it’s not confident in it’s prediction), just treat that as “other”.

thanks for your response @neuralpat.
mh… I feel like extractive summarizsation is not a good fit, because there is no reason why the summarizer would pick the sentences I’m looking for for the classification task.
regarding the suggested classification thresholds: yeah true that could help a bit, if the classifier is undecided between classes. the issue is that the classifer is forced to only give a probability to the classes it has learned and by experience and softmax will often give a high probability to just some class, just because no other class fits.

I feel like there must be more elegant solutions for this / more literature, because it’s such a fundamental issue for real world applications where not every sentence/text input fits the classes you have.

Hi @MoritzLaurer, what about first using a binary classifier (or even regex :wink:) to pre-select the sentences that are relevant / not relevant, followed by applying your 20-class classifier to the relevant sentences?

1 Like

yeah, that’s good advice @lewtun, I’m actually playing with this approach now.
The key challenge is to get good training data for the artificial “Other” class: few false positives, good balance of semantic (dis)similarity to the main class etc. If you have any hints for literature (for keywords to google), that would be very helpful. Sorting out irrelevant sentences/inputs seems to be a key issue for real world applications for transformers, but I didn’t find that much guidance on it online. (good topics seem to be “PU learning” or one-class classification)

Ah framing the problem as an “anomaly detection” task is a nice idea! (The idea being that the 1% of the sentences you care about are “anomalous” in some sense).

At least for time series, one interesting paper I saw last year was based on using the “concentration phenomenon” to detect anomalies in high-dimensional datasets:

I wonder whether this idea could be extended to the embedding space of your transformer models? The paper also refers to previous work on anomaly detectors, including one-class SVMs that might also be applicable to your use case.

Another idea off the top of my head is whether you could use the Local Outlier Factor method to the embedding space (or some dimensional projection theoreof)?

1 Like

Very interesting idea, I’m looking into it, thanks!

1 Like

It’s been a while but this thread is closely related to an issue that I am running into regarding a research manuscript I currently have under review (for the 2nd time).

Like @MoritzLaurer, I am classifying statements that you would see on a personality test (e.g., “I enjoy going to parties.”, “I wear my emotions on my sleeve.”, etc.) to one of the (big) five personality traits: Neuroticism, Extraversion, Openness, Conscientiousness, and Agreeableness. However, the reviewers did not like the label set the first time around given that there might be statements that represent non-personality traits, or personality traits that are beyond the 5 mentioned above. To deal with this in my second submission I added an “Other” category, which consisted of statements that have been proposed to measure non-Big Five traits (e.g., “I like to look at my body.”, “I bought something for my collection.”). The model did terrible when predicting the “Other” class because there was no consistency among the sentences I sampled. Additionally, some of the “Other” sentences do have a grounded relationship with one of the five target labels. The reviewers also weren’t fans of my label distribution for training and testing (i.e., “Other” made up 18% of my final samples).So here I am asking for any suggestions I might be able to try.

  • When looking at the research, a multi-stage approach with a binary anomaly detection (@lewtun’s suggestion) followed by a multi-class classification seems feasible. I do have a question about the first model. In practice this would just be a binary classification problem where 5 target classes are treated as 1 Positive class, and the Negative class would be…what? All the sentences I can find that might belong to “Other”? Should this also involve synthetic data—where I randomly shuffle sentences observed in the Positive class?

  • Last (hopefully related) question. The reviewers suggested several ways for me to get a more agreed upon estimate of the out-of-sample/“Other” cases. They thought 18% was far too high for the context that this model would be applied to. Let’s say it’s estimated to be 5%. Is there a way I could take this into account (in terms of precision and recall) without training with an “Other” class and just using my 5 target classes? I just feel like the “Other” class is far too susceptible to sampling bias. I could select sequences of text that are totally unrelated to the use-case; however, they are still technically “Other”.

Thank you all for being so helpful!

1 Like