I would like to add a few custom functions for pre-tokenization. For example, I would like to split numerical text from any non-numerical test.
Eg
â1000mgâ would become [â1000â, âmgâ].
I am trying to figure out the proper way to do this for the python binding; I think it may be a bit tricky since its a binding for the original rust version.
Which I am guessing may be where I could potentially as some pretokenization functions, but it doesnât seem to return anything. I noticed that itâs expecting an instance of the PreTokenizedString defined here
Which does seem to have some text processing functions. But they donât seem to return anything. I am guessing that any additional rules need to be implemented in the original rust version itself?
I am looking at the rust pretokenizers code, it seems that I have to add any additional preprocessing code here
Does this seem like the right track for adding additional preprocessing code?
It it makes a difference, what I am trying to do is train a brand new tokenizer.
There are multiple ways to customize the pre-tokenization process:
Using existing components
The tokenizers library provides many different PreTokenizer that you can use, and even combine as you wish to. There is a list of components in the official documentation
Using custom components written in Python
It is possible to customize some of the components (Normalizer, PreTokenizer, and Decoder) using Python code. This hasnât been documented yet, but you can find an example here. It lets you directly manipulate the NormalizedString or PreTokenizedString to normalize and pre-tokenize as you wish.
Now for the example you mentioned (ie â1000mgâ would become [â1000â, âmgâ]), you can probably use the DigitsPreTokenizer that does exactly this.
If you didnât get a chance to familiarize yourself with the Getting started part of our documentation, I think you will love it as it explains a bit more how to customize your tokenizer, and gives concrete examples.
I didnât know the tokenizers library had official documentation , it doesnât seem to be listed on the github or pip pages, and googling âhuggingface tokenizers documentationâ just gives links to the transformers library instead. It doesnât seem to be on the huggingface.co main page either.
I was able to create a Custom Pretokenizer based on the example linked above. But Iâm not able to save the tokenizer due to the exception âCustom PreTokenizer cannot be serializedâ. Iâm wondering how to bypass this.
Are there plans for this to become a documented part of the API? I notice that the CustomDecoder code no longer works (I believe the method name changed), it would be great to have a stable API for this stuff (although I get itâs a pretty niche thing)