Could you tell me, please, are there any techniques to tell NN which classes to categorize texts into, when there is no labeled training set or it is very small (1-2 instance for each class)? For example, is it possible to give some key words for each class, so that NN clusterized texts accordingly? Otherwise, NN produces classes which are of no interest to me.
My task at hand, is to classify texts on 1500 predefined categories. I was able to do it with GLDA Guided Latent Dirichlet Allocation, but I believe that I can achieve better results using NN.
I will be more than happy if you share links to models/articles or your thoughts. Thanks in advance.
I feel like you can use zero shot text classification models to label your data, I don’t know if 1500 categories is too much though. Another idea: I recently came across this blog post on using BERT for topic modelling (it’s like an extension of using embeddings for topic modelling). The author of the blog post is the owner of a package called BERTopic which is something you might use. It’s based on transformers.
Merve, thanks for your suggestions! It took me some time to check if those models are applicable for my task. And I am certain now that I will give them a try.
In addition, I am considering to try other two methods:
- Tune some sentiment-analysis BERT model, to make it predict classes instead. Though I am in doubt that it will give good results, since I have much much more classes than emotional states it operates with.
- Try active learning, with some smart schemes to ease the process of assigning one of 1.5k labels to text for assessors. This option is costly.
If your labeled data are not enough, first you can try data augmentation. You can also think about self-supervised learning. It may help you.