There are multiple ways to customize the pre-tokenization process:
-
Using existing components
Thetokenizers
library provides many differentPreTokenizer
that you can use, and even combine as you wish to. There is a list of components in the official documentation -
Using custom components written in Python
It is possible to customize some of the components (Normalizer
,PreTokenizer
, andDecoder
) using Python code. This hasn’t been documented yet, but you can find an example here. It lets you directly manipulate theNormalizedString
orPreTokenizedString
to normalize and pre-tokenize as you wish.
Now for the example you mentioned (ie ‘1000mg’ would become [‘1000’, ‘mg’]
), you can probably use the Digits
PreTokenizer
that does exactly this.
If you didn’t get a chance to familiarize yourself with the Getting started
part of our documentation, I think you will love it as it explains a bit more how to customize your tokenizer, and gives concrete examples.