I would like to add a few custom functions for pre-tokenization. For example, I would like to split numerical text from any non-numerical test.
Eg
â1000mgâ would become [â1000â, âmgâ].
I am trying to figure out the proper way to do this for the python binding; I think it may be a bit tricky since its a binding for the original rust version.
I am looking at the pretokenizer function
/huggingface/tokenizers/blob/2ccd16bf5c3dd97759d7bdf5229e2feeba314b4a/bindings/python/py_src/tokenizers/pre_tokenizers/init.pyi#L6
Which I am guessing may be where I could potentially as some pretokenization functions, but it doesnât seem to return anything. I noticed that itâs expecting an instance of the PreTokenizedString
defined here
/huggingface/tokenizers/blob/2ccd16bf5c3dd97759d7bdf5229e2feeba314b4a/bindings/python/py_src/tokenizers/init.pyi#L55
Which does seem to have some text processing functions. But they donât seem to return anything. I am guessing that any additional rules need to be implemented in the original rust version itself?
I am looking at the rust pretokenizers code, it seems that I have to add any additional preprocessing code here
Does this seem like the right track for adding additional preprocessing code?
It it makes a difference, what I am trying to do is train a brand new tokenizer.