How to add additional custom pre-tokenization processing?

anthony · October 20, 2020, 4:44pm

There are multiple ways to customize the pre-tokenization process:

Using existing components
The tokenizers library provides many different PreTokenizer that you can use, and even combine as you wish to. There is a list of components in the official documentation
Using custom components written in Python
It is possible to customize some of the components (Normalizer, PreTokenizer, and Decoder) using Python code. This hasn’t been documented yet, but you can find an example here. It lets you directly manipulate the NormalizedString or PreTokenizedString to normalize and pre-tokenize as you wish.

Now for the example you mentioned (ie ‘1000mg’ would become [‘1000’, ‘mg’]), you can probably use the Digits PreTokenizer that does exactly this.

If you didn’t get a chance to familiarize yourself with the Getting started part of our documentation, I think you will love it as it explains a bit more how to customize your tokenizer, and gives concrete examples.

Topic		Replies	Views
Implementing custom tokenizer components (normalizers, processors) 🤗Tokenizers	1	2892	November 30, 2021
Error creating custom pre_tokenizer 🤗Tokenizers	3	49	January 2, 2025
Custom PostProcessor? 🤗Tokenizers	0	918	November 10, 2022
Save tokenizer with argument 🤗Tokenizers	2	1966	October 26, 2022
Writing custom tokenizer and wrapping it in tokenizer object 🤗Tokenizers	2	794	June 26, 2023

How to add additional custom pre-tokenization processing?

Related topics