Supervised BERTopic with multiple topics per document

noahts · November 9, 2023, 9:30am

Hello -
We have a collection of about 100.000 Danish articles that have topics assigned to them by profesionals. We would like to build a model that can help them with suggestions when new articles need to be cataloged with topics.

I have looked into BERTopic and I can see there is a guide on how to do supervised topic modeling (https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html), and there is another guide on how to deal with the situation where there is more than one topic per document (https://maartengr.github.io/BERTopic/getting_started/distribution/distribution.html). I am unsure if it is possible/easy to combine the two approaches.

Does anybody have experience with such a task? Is BERTopic a good framework here?

Thanks,

/Noah

panigrah · November 9, 2023, 12:09pm

Lookup the calculate_probabilities=True flag. This should generate the probabilities of each topic so you are doing both steps in one go.

Alternatively you can also consider a two step process like the one outlined here Topic modelling and multiclass text classification using transformers | by Paulo Yun Cha | Oct, 2023 | Medium

noahts · November 9, 2023, 2:40pm

Thanks, that’s certainly relevant and much appreciated.

I’m sorry I didn’t state it clearly before, but the setup is that each of the documents in our training data has multiple topics assigned (3-6, not more than that), and I am not sure how to feed that into the model in either of the setups described. What am I missing?

Thanks again,
/Noah

panigrah · November 12, 2023, 7:35am

How many topics do you have? You can consider a slightly different approach like in this workbook if you don’t have a large (20+) number of topics.

or this one

noahts · November 13, 2023, 10:17am

Thanks again. I’ll definitely have a look at it, but we have thousands of topics, so I am not sure if this will work. But it might be used as inspiration for new ideas!

panigrah · November 13, 2023, 12:07pm

What is the ratio of topics to documents? If 1k of topics for 100k docs, perhaps need to look at this as a hierarchical classification or is it a summarisation problem where you are trying to create a keyword summary for each doc.

Perhaps scikit will work better for this. See the example here. Working With Text Data — scikit-learn 1.3.2 documentation

juhoinkinen · November 16, 2023, 8:19pm

@noahts You could take a look at Annif.

If you have a vocabulary (a list of all the possible topics), then Annif could be readily applied to your task. Or maybe you can construct a vocabulary from the all the topics already assigned to articles?

Annif is a more traditional ML tool for (extreme) multilabel classification intended for libraries, archives and museums. The YSO vocabulary in use at the demo page annif.org has a vocabulary with over 30 000 concepts (i.e. topics).

juhoinkinen · November 16, 2023, 8:22pm

The YSO vocabulary is the General Finnish ontology maintained at National Library of Finland.

(Yeah, as a new user at the forum I had a limit of max. 2 links per post.)

Topic		Replies	Views
Classification tweets by theme: How do i start? Beginners	5	680	March 7, 2022
Multilabel classification for text Beginners	1	481	January 15, 2021
Dealing with Imbalanced Datasets? Research	1	5474	March 11, 2021
Best approach for multi-label multi-class texts in 2022? Beginners	0	338	October 7, 2022
Multi-class Classification Basics Beginners	4	4608	August 24, 2021

Supervised BERTopic with multiple topics per document

Related topics