Iâm wondering if there is an easy way to tweak the individual components of a tokenizer. Specifically, Iâd like to implement a custom normalizer and post-processor.
Just to provide some context, Iâm trying to train a Danish tokenizer. Danish has a lot of compound nouns (e.g., the Danish translation of âhouse ownerâ is âhusejerâ, with âhusâ being âhouseâ and âejerâ being âownerâ), so a tokenizer should split these accordingly. A standard BPE or WordPiece can deal with this just fine.
The issue is that for some compound nouns, we impose an âsâ in between the two words. For instance, âbirthday greetingâ is âfødselsdagshilsenâ, with âfødselsdagâ being âbirthdayâ and âhilsenâ being âgreetingâ. This messes up the tokenizer completely, tokenizing it as [âfødselsdagâ, âshiâ, âlâ, âsenâ] rather than the ideal [âfødselsdagâ, âsâ, âhilsenâ].
I think I can solve it by imposing a new special token, <conn>
, and at the normaliser stage I check if the word is of the form <word1>
s<word2>
where <word1>
and <word2>
are known words, and if so, replaces the âsâ by <conn>
. At the post-processing stage, I then replace the <conn>
instances with âsâ again.
Long story short, is there a way to simply subclass the normaliser/processor classes to implement such behaviours?