I’m wondering if there is an easy way to tweak the individual components of a tokenizer. Specifically, I’d like to implement a custom normalizer and post-processor.
Just to provide some context, I’m trying to train a Danish tokenizer. Danish has a lot of compound nouns (e.g., the Danish translation of “house owner” is “husejer”, with “hus” being “house” and “ejer” being “owner”), so a tokenizer should split these accordingly. A standard BPE or WordPiece can deal with this just fine.
The issue is that for some compound nouns, we impose an “s” in between the two words. For instance, “birthday greeting” is “fødselsdagshilsen”, with “fødselsdag” being “birthday” and “hilsen” being “greeting”. This messes up the tokenizer completely, tokenizing it as [‘fødselsdag’, ‘shi’, ‘l’, ‘sen’] rather than the ideal [‘fødselsdag’, ‘s’, ‘hilsen’].
I think I can solve it by imposing a new special token,
<conn>, and at the normaliser stage I check if the word is of the form
<word2> are known words, and if so, replaces the “s” by
<conn>. At the post-processing stage, I then replace the
<conn> instances with “s” again.
Long story short, is there a way to simply subclass the normaliser/processor classes to implement such behaviours?