Implementing custom tokenizer components (normalizers, processors)

saattrupdan · November 29, 2021, 6:06pm

I’m wondering if there is an easy way to tweak the individual components of a tokenizer. Specifically, I’d like to implement a custom normalizer and post-processor.

Just to provide some context, I’m trying to train a Danish tokenizer. Danish has a lot of compound nouns (e.g., the Danish translation of “house owner” is “husejer”, with “hus” being “house” and “ejer” being “owner”), so a tokenizer should split these accordingly. A standard BPE or WordPiece can deal with this just fine.

The issue is that for some compound nouns, we impose an “s” in between the two words. For instance, “birthday greeting” is “fødselsdagshilsen”, with “fødselsdag” being “birthday” and “hilsen” being “greeting”. This messes up the tokenizer completely, tokenizing it as [‘fødselsdag’, ‘shi’, ‘l’, ‘sen’] rather than the ideal [‘fødselsdag’, ‘s’, ‘hilsen’].

I think I can solve it by imposing a new special token, <conn>, and at the normaliser stage I check if the word is of the form <word1>s<word2> where <word1> and <word2> are known words, and if so, replaces the “s” by <conn>. At the post-processing stage, I then replace the <conn> instances with “s” again.

Long story short, is there a way to simply subclass the normaliser/processor classes to implement such behaviours?

saattrupdan · November 30, 2021, 12:41pm

For anyone else looking, this can be done, and it’s answered in this question:

Topic		Replies	Views
How to add additional custom pre-tokenization processing? 🤗Tokenizers	6	5183	March 7, 2023
Custom PostProcessor? 🤗Tokenizers	0	915	November 10, 2022
Modifying normalizer for pretrained tokenizers don't consistently work 🤗Tokenizers	2	118	June 12, 2024
Tokenizer post_processor help 🤗Tokenizers	1	1366	October 27, 2022
Adding atomic / indivisible tokens to BPE tokenizer 🤗Tokenizers	8	34	July 3, 2025

Implementing custom tokenizer components (normalizers, processors)

Related topics