Theme Extraction from Text

I’m embarking on a project that involves creating a text classification model using Hugging Face’s transformers. The goal is to categorize a diverse dataset into a set of broad, predefined themes. Additionally, the model should be capable of suggesting new themes for entries that don’t fit into the existing categories.

I am not sure if this would be a classification since here number of classes can be huge in hundreds. Also if I choose topic modelling it may give distnct themes for even similar text entries.

Please suggest how to approach this.


This looks more like a clustering problem. See for instance this page: Clustering — Sentence-Transformers documentation.