Hello -
We have a collection of about 100.000 Danish articles that have topics assigned to them by profesionals. We would like to build a model that can help them with suggestions when new articles need to be cataloged with topics.
Thanks, that’s certainly relevant and much appreciated.
I’m sorry I didn’t state it clearly before, but the setup is that each of the documents in our training data has multiple topics assigned (3-6, not more than that), and I am not sure how to feed that into the model in either of the setups described. What am I missing?
Thanks again. I’ll definitely have a look at it, but we have thousands of topics, so I am not sure if this will work. But it might be used as inspiration for new ideas!
What is the ratio of topics to documents? If 1k of topics for 100k docs, perhaps need to look at this as a hierarchical classification or is it a summarisation problem where you are trying to create a keyword summary for each doc.
If you have a vocabulary (a list of all the possible topics), then Annif could be readily applied to your task. Or maybe you can construct a vocabulary from the all the topics already assigned to articles?
Annif is a more traditional ML tool for (extreme) multilabel classification intended for libraries, archives and museums. The YSO vocabulary in use at the demo page annif.org has a vocabulary with over 30 000 concepts (i.e. topics).