Annif - toolkit for multilabel text classification

juhoinkinen · April 25, 2024, 2:15pm

We are pleased to announce the release of Annif 1.1!

Annif is a multi-algorithm automated subject indexing tool intended for libraries, archives and museums. It suggest subjects or topics from a predefined vocabulary which can be a thesaurus, ontology or just a list of subjects. The number of the subjects in the vocabulary can be large, tens of thousands or even more, and thus the task Annif performs can be called extreme multilabel classification.

Annif uses more traditional machine learning techniques, not LLMs, which makes it very fast in inference: typically it gives subjects for a text corresponding to a PDF of tens of pages in less than one second. Annif has a CLI for administrative tasks and a REST API for end users. Its development started and continues at the National Library of Finland, but all are welcome to join in!

Regarding Hugging Face, Annif 1.1 introduced annif upload and annif download commands which can be used to push and pull a set of selected projects and vocabularies to and from a Hugging Face Hub repository.

Check out these resources:

juhoinkinen · April 25, 2024, 2:22pm

PS Maybe someone could forward my above message to the HF posts feed? I’m still in the waitlist for it, so can’t post there myself.

juhoinkinen · October 2, 2024, 11:16am

Annif 1.2 has been released!

This release introduces language detection capabilities in the REST API and CLI, improves Hugging Face Hub integration, and also includes the usual maintenance work and minor bug fixes.

The new REST API endpoint /v1/detect-language expects POST requests that contain a JSON object with the text whose language is to be analyzed and a list of candidate languages. Similarly, the CLI has a new command annif detect-language. Annif projects are typically language specific, so a text of a given language needs to be processed with a project intended for that language; the language detection feature can help in this. For details see this Wiki page. The language detection is performed with the Simplemma library by @adbar et al.

The annif download command has a new --trust-repo option, which needs to be used if the repository to download from has not been used previously (that is if the repository does not appear in the local Hugging Face Hub cache). This option is introduced to raise awareness of the risks of downloading projects from the internet; the project downloads should only be done from trusted sources. For more information see the Hugging Face Hub documentation.

This release also includes automation of downloading the NLTK datapackage used for tokenization to simplify Annif installation. Maintenance tasks include upgrading dependencies, including a new version of Simplemma that allows better control over memory usage. The bug fixes include restoring the --host option of the annif run command.

Python 3.12 is now fully supported (previously NN-ensemble and STWFSA backends were not supported on Python 3.12).

Supported Python versions:

3.9, 3.10, 3.11 and 3.12

Backward compatibility:

NN ensemble projects trained with Annif v1.1 or older need to be retrained.
For other projects, the warnings by SciKit-learn are harmless.

juhoinkinen · November 12, 2024, 8:51am

Have you been using Annif for subject indexing or classification? What do you think of it?

We’re interested in your feedback!

The Annif users survey is open until November 30: https://forms.gle/P7jGoPMbEAJnD9zw9

juhoinkinen · February 10, 2025, 9:54am

Annif 1.3 has been released!

This release introduces a new EstNLTK analyzer, improves the performance of the MLLM backend and fixes minor bugs.

The key enhancement of this release is the addition of a new analyzer for lemmatization using EstNLTK, which supports the Estonian language. This analyzer needs to be installed separately, see the Optional features and dependencies in Wiki. Note that the indirect dependencies of EstNLTK are quite large, requiring around 500 MB of libraries.

Another improvement is the optimization of the ambiguity feature calculation in the MLLM algorithm. Previously, the calculation could be slow, especially when dealing with a large number of matches when using a large vocabulary such as GND. This optimization addresses the quadratic nature of the ambiguity calculation, and is expected to greatly reduce the processing time of some documents.

This release also includes maintenance updates and bug fixes. The file permissions issue, where Annif did not adhere to the umask setting for data files, has been resolved, thus easing Annif use in multiuser environments.

Supported Python versions:

3.9, 3.10, 3.11, and 3.12

Backward compatibility:

The projects trained with Annif v1.2 remain working.

juhoinkinen · March 24, 2025, 10:11am

Last November we organized a survey for Annif users, and now the results have been published in the Doria repository of the National Library of Finland: https://www.doria.fi/bitstream/handle/10024/190930/Annif%20Users%20Survey.pdf

The report includes an overview of:

The vocabularies and datasets that are used with Annif
The workflows that Annif is integrated with
The problems Annif users are facing

The average ratings for various aspects and features of Annif given by users are shown. In short, in a scale from 1 to 5, the ratings are:

Overall: 4.4
Features and functions: 4.1
Documentation: 4.5
Smoothness of initial setup: 4.2
Usability: 4.4
Achieved quality of subject suggestions: 3.6

The survey also gathered user views on the improvements and new features, which are briefly discussed in the report.

juhoinkinen · May 8, 2025, 11:35am

We (@osma, @MonaLehtinen & me, i.e. the Annif team at the National Library of Finland) recently took part in the LLMs4Subjects challenge at the SemEval-2025 workshop. The task was to use large language models (LLMs) to generate good quality subject indexing for bibliographic records, i.e. titles and abstracts.

We are glad to report that our system performed well; it was ranked

1st in the category where the full vocabulary was used
2nd in the smaller vocabulary category
4th in the qualitative evaluations.

14 participating teams developed their own solutions for generating subject headings and the output of each system was assessed using both quantitative and qualitative evaluations. Research papers about most of the systems are going to be published around the time of the workshop in late July, and many pre-prints are already available.

We applied Annif together with several LLMs that we used to preprocess the data sets: translated the GND vocabulary terms to English, translated bibliographic records into English and German as required, and generated additional synthetic training data. After the preprocessing, we used the traditional machine learning algorithms in Annif as well as the experimental XTransformer algorithm that is based on language models. We also combined the subject suggestions generated using English and German language records in a novel way.

More information can be found in our system description preprint: Paper page - Annif at SemEval-2025 Task 5: Traditional XMTC augmented by LLMs

See also the task description preprint: Paper page - SemEval-2025 Task 5: LLMs4Subjects -- LLM-based Automated Subject Tagging for a National Technical Library's Open-Access Catalog

The Annif models trained for this task are available here:
NatLibFi/Annif-LLMs4Subjects-data

juhoinkinen · June 16, 2025, 8:28am

Recently we investigated how the relevance of keyword suggestions produced by Annif has evolved in the publication repositories NatLibFi, see the report (in Finnish). The results provide insights into the development and usefulness of AI-based subject indexing.

The plot shows the the scores averaged over months (dots) and between Annif model updates (horizontal lines): the subject suggestion quality has essentially remained constant since the third model update:

Topic		Replies	Views
Mutli-label classification for large free text input Beginners	2	48	December 10, 2024
German NLP Repository Languages at Hugging Face	11	4537	November 21, 2023
A new dataset for multi-label text classification Intermediate	1	1046	September 30, 2021
Text Classification With LLM Beginners	4	2674	January 13, 2025
ACL 2020 - Some personal highlights - Victor Research	4	1367	July 14, 2020

Annif - toolkit for multilabel text classification

Annif 1.2 has been released!

Supported Python versions:

Backward compatibility:

Related topics