How to add additional custom pre-tokenization processing?

I would like to add a few custom functions for pre-tokenization. For example, I would like to split numerical text from any non-numerical test.

Eg

‘1000mg’ would become [‘1000’, ‘mg’].

I am trying to figure out the proper way to do this for the python binding; I think it may be a bit tricky since its a binding for the original rust version.

I am looking at the pretokenizer function

/huggingface/tokenizers/blob/2ccd16bf5c3dd97759d7bdf5229e2feeba314b4a/bindings/python/py_src/tokenizers/pre_tokenizers/init.pyi#L6

Which I am guessing may be where I could potentially as some pretokenization functions, but it doesn’t seem to return anything. I noticed that it’s expecting an instance of the PreTokenizedString defined here

/huggingface/tokenizers/blob/2ccd16bf5c3dd97759d7bdf5229e2feeba314b4a/bindings/python/py_src/tokenizers/init.pyi#L55

Which does seem to have some text processing functions. But they don’t seem to return anything. I am guessing that any additional rules need to be implemented in the original rust version itself?

I am looking at the rust pretokenizers code, it seems that I have to add any additional preprocessing code here

Does this seem like the right track for adding additional preprocessing code?

It it makes a difference, what I am trying to do is train a brand new tokenizer.

Hi @reSearch2vec

There are multiple ways to customize the pre-tokenization process:

  1. Using existing components
    The tokenizers library provides many different PreTokenizer that you can use, and even combine as you wish to. There is a list of components in the official documentation

  2. Using custom components written in Python
    It is possible to customize some of the components (Normalizer, PreTokenizer, and Decoder) using Python code. This hasn’t been documented yet, but you can find an example here. It lets you directly manipulate the NormalizedString or PreTokenizedString to normalize and pre-tokenize as you wish.

Now for the example you mentioned (ie ‘1000mg’ would become [‘1000’, ‘mg’]), you can probably use the Digits PreTokenizer that does exactly this.

If you didn’t get a chance to familiarize yourself with the Getting started part of our documentation, I think you will love it as it explains a bit more how to customize your tokenizer, and gives concrete examples.

2 Likes

Thanks Anthony! A lot of great info.

I didn’t know the tokenizers library had official documentation , it doesn’t seem to be listed on the github or pip pages, and googling ‘huggingface tokenizers documentation’ just gives links to the transformers library instead. It doesn’t seem to be on the huggingface.co main page either.

Very much looking forward to reading it.

1 Like