Given class names, compose seed words lists to help text classifier to discriminate

Hi, everyone!

TLDR, I need your guidance on what algorithm to use to compose, given class names, lists of 20-50 words for each class, which then being fed into GLDA model will help it to discriminate between those classes. Also,

  1. There is a hierarchy of classes.
  2. Depending on the class-level in the hierarchy the number of classes varies from 2 to 30.
  3. Class names consist of 2-3 meaningful words, such as: ‘Arts & Entertainment’.
  4. There is no labeled corpus of text to train keyword extraction.

A bit more context:
Earlier, I was asking for the model which can classify texts into 1500 classes in zero-shot setting, where I received very insightful recommendation by @merve, according to which I tried a bunch of MNLI-tuned models such as Facebook/bart-large-mnli, bert-mnli, roberta-mnli, their distilled versions. The problem with them is that they are slow. BART need around 2 minutes to classify one 30-word text. Distilled-BART needs a minute, distilled Bert/roberta 20 seconds, which is still too slow. The rule is: the longer it takes the better accuracy of model. In contrast, GLDA labels texts almost instantly, but with large number of classes it has poor acuracy.
Also, I tried BARTopic, but results were poor. It was not able to label big share of texts.

Then I came up with the idea, that it is possible to assign for each class-hierarchy level and for each class inside the level disjoint set of seed-words. And then to use several GLDA-models, which operate on different levels. It may work out because GLDA have good accuracy with less then 150 classes and it should be faster than NLI-models.

Now, I need to obtain discriminative seed-lists. At first, I was thinking about using word2vec distributions which allow me to get top n most relevant words for each class, but then I realized that resulting lists are not discriminative enough for lower hierarchy-levels. Word2vec may help to discriminate texts about cars and food, for instance, but it may fail on, for instance, sub-levels of food-category, such as: fruits and vegetables, because their embeddings are close in embedding-space and it takes seed-embeddings from these overlapping neighbours.

I will be more than happy if you share links to models/articles or your thoughts. Thanks in advance. :hugs:

1 Like