How to add additional custom pre-tokenization processing?

reSearch2vec · October 19, 2020, 9:14pm

I would like to add a few custom functions for pre-tokenization. For example, I would like to split numerical text from any non-numerical test.

Eg

‘1000mg’ would become [‘1000’, ‘mg’].

I am trying to figure out the proper way to do this for the python binding; I think it may be a bit tricky since its a binding for the original rust version.

I am looking at the pretokenizer function

/huggingface/tokenizers/blob/2ccd16bf5c3dd97759d7bdf5229e2feeba314b4a/bindings/python/py_src/tokenizers/pre_tokenizers/init.pyi#L6

Which I am guessing may be where I could potentially as some pretokenization functions, but it doesn’t seem to return anything. I noticed that it’s expecting an instance of the PreTokenizedString defined here

/huggingface/tokenizers/blob/2ccd16bf5c3dd97759d7bdf5229e2feeba314b4a/bindings/python/py_src/tokenizers/init.pyi#L55

Which does seem to have some text processing functions. But they don’t seem to return anything. I am guessing that any additional rules need to be implemented in the original rust version itself?

I am looking at the rust pretokenizers code, it seems that I have to add any additional preprocessing code here

github.com

huggingface/tokenizers/blob/master/tokenizers/src/pre_tokenizers/unicode_scripts/pre_tokenizer.rs

use crate::pre_tokenizers::unicode_scripts::scripts::{get_script, Script};
use crate::tokenizer::{normalizer::Range, PreTokenizedString, PreTokenizer, Result};

#[derive(Clone, Debug)]
pub struct UnicodeScripts;
impl_serde_unit_struct!(UnicodeScriptsVisitor, UnicodeScripts);

impl UnicodeScripts {
    pub fn new() -> Self {
        Self {}
    }
}

impl Default for UnicodeScripts {
    fn default() -> Self {
        Self::new()
    }
}

// This code exists in the Unigram default IsValidSentencePiece.

This file has been truncated. show original

Does this seem like the right track for adding additional preprocessing code?

It it makes a difference, what I am trying to do is train a brand new tokenizer.

anthony · October 20, 2020, 4:44pm

Hi @reSearch2vec

There are multiple ways to customize the pre-tokenization process:

Using existing components
The tokenizers library provides many different PreTokenizer that you can use, and even combine as you wish to. There is a list of components in the official documentation
Using custom components written in Python
It is possible to customize some of the components (Normalizer, PreTokenizer, and Decoder) using Python code. This hasn’t been documented yet, but you can find an example here. It lets you directly manipulate the NormalizedString or PreTokenizedString to normalize and pre-tokenize as you wish.

Now for the example you mentioned (ie ‘1000mg’ would become [‘1000’, ‘mg’]), you can probably use the Digits PreTokenizer that does exactly this.

If you didn’t get a chance to familiarize yourself with the Getting started part of our documentation, I think you will love it as it explains a bit more how to customize your tokenizer, and gives concrete examples.

reSearch2vec · October 21, 2020, 8:03am

Thanks Anthony! A lot of great info.

I didn’t know the tokenizers library had official documentation , it doesn’t seem to be listed on the github or pip pages, and googling ‘huggingface tokenizers documentation’ just gives links to the transformers library instead. It doesn’t seem to be on the huggingface.co main page either.

Very much looking forward to reading it.

vitali · February 20, 2021, 5:08am

[quote=“anthony, post:2, topic:1637”]
official documentation
[/quote] points to localhost, could you update the link

anthony · February 20, 2021, 6:21pm

Thanks for letting me know! Just fixed it.

shtoshni · March 30, 2021, 3:13pm

Hi,

I was able to create a Custom Pretokenizer based on the example linked above. But I’m not able to save the tokenizer due to the exception “Custom PreTokenizer cannot be serialized”. I’m wondering how to bypass this.

Davidg707 · March 7, 2023, 4:40am

Are there plans for this to become a documented part of the API? I notice that the CustomDecoder code no longer works (I believe the method name changed), it would be great to have a stable API for this stuff (although I get it’s a pretty niche thing)

Topic		Replies	Views
Implementing custom tokenizer components (normalizers, processors) 🤗Tokenizers	1	2883	November 30, 2021
Error creating custom pre_tokenizer 🤗Tokenizers	3	44	January 2, 2025
Custom PostProcessor? 🤗Tokenizers	0	915	November 10, 2022
Save tokenizer with argument 🤗Tokenizers	2	1965	October 26, 2022
Writing custom tokenizer and wrapping it in tokenizer object 🤗Tokenizers	2	792	June 26, 2023

How to add additional custom pre-tokenization processing?

Related topics