We have a collection of about 100.000 Danish articles that have topics assigned to them by profesionals. We would like to build a model that can help them with suggestions when new articles need to be cataloged with topics.
I have looked into BERTopic and I can see there is a guide on how to do supervised topic modeling (https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html), and there is another guide on how to deal with the situation where there is more than one topic per document (https://maartengr.github.io/BERTopic/getting_started/distribution/distribution.html). I am unsure if it is possible/easy to combine the two approaches.
Does anybody have experience with such a task? Is BERTopic a good framework here?
Lookup the calculate_probabilities=True flag. This should generate the probabilities of each topic so you are doing both steps in one go.
Alternatively you can also consider a two step process like the one outlined here Topic modelling and multiclass text classification using transformers | by Paulo Yun Cha | Oct, 2023 | Medium
Thanks, that’s certainly relevant and much appreciated.
I’m sorry I didn’t state it clearly before, but the setup is that each of the documents in our training data has multiple topics assigned (3-6, not more than that), and I am not sure how to feed that into the model in either of the setups described. What am I missing?
How many topics do you have? You can consider a slightly different approach like in this workbook if you don’t have a large (20+) number of topics.
or this one
Thanks again. I’ll definitely have a look at it, but we have thousands of topics, so I am not sure if this will work. But it might be used as inspiration for new ideas!
What is the ratio of topics to documents? If 1k of topics for 100k docs, perhaps need to look at this as a hierarchical classification or is it a summarisation problem where you are trying to create a keyword summary for each doc.
Perhaps scikit will work better for this. See the example here. Working With Text Data — scikit-learn 1.3.2 documentation
@noahts You could take a look at Annif.
If you have a vocabulary (a list of all the possible topics), then Annif could be readily applied to your task. Or maybe you can construct a vocabulary from the all the topics already assigned to articles?
Annif is a more traditional ML tool for (extreme) multilabel classification intended for libraries, archives and museums. The YSO vocabulary in use at the demo page annif.org has a vocabulary with over 30 000 concepts (i.e. topics).
The YSO vocabulary is the General Finnish ontology maintained at National Library of Finland.
(Yeah, as a new user at the forum I had a limit of max. 2 links per post.)