I would like to add a few custom functions for pre-tokenization. For example, I would like to split numerical text from any non-numerical test.
‘1000mg’ would become [‘1000’, ‘mg’].
I am trying to figure out the proper way to do this for the python binding; I think it may be a bit tricky since its a binding for the original rust version.
I am looking at the pretokenizer function
Which I am guessing may be where I could potentially as some pretokenization functions, but it doesn’t seem to return anything. I noticed that it’s expecting an instance of the
PreTokenizedString defined here
Which does seem to have some text processing functions. But they don’t seem to return anything. I am guessing that any additional rules need to be implemented in the original rust version itself?
I am looking at the rust pretokenizers code, it seems that I have to add any additional preprocessing code here
Does this seem like the right track for adding additional preprocessing code?
It it makes a difference, what I am trying to do is train a brand new tokenizer.