How to add additional custom pre-tokenization processing?

Hi @reSearch2vec

There are multiple ways to customize the pre-tokenization process:

  1. Using existing components
    The tokenizers library provides many different PreTokenizer that you can use, and even combine as you wish to. There is a list of components in the official documentation

  2. Using custom components written in Python
    It is possible to customize some of the components (Normalizer, PreTokenizer, and Decoder) using Python code. This hasn’t been documented yet, but you can find an example here. It lets you directly manipulate the NormalizedString or PreTokenizedString to normalize and pre-tokenize as you wish.

Now for the example you mentioned (ie ‘1000mg’ would become [‘1000’, ‘mg’]), you can probably use the Digits PreTokenizer that does exactly this.

If you didn’t get a chance to familiarize yourself with the Getting started part of our documentation, I think you will love it as it explains a bit more how to customize your tokenizer, and gives concrete examples.

7 Likes