Tokenizer post_processor help

I’m struggling to get the post_processor to work for tokenisation. Can anyone point me in the right direction?

Here’s the minimal example

from transformers import AutoTokenizer
from tokenizers.processors import TemplateProcessing

tokenizer = AutoTokenizer.from_pretrained( 'bert-base-uncased' )

tokenizer.post_processor = TemplateProcessing(
    single="[CLS] [CLS] $0 [SEP] [CLS]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", tokenizer.cls_token_id), 
        ("[SEP]", tokenizer.sep_token_id)
    ],
)

text_string = "The cat sat on the mat."

tokens = tokenizer( text_string )

print( tokenizer.decode( tokens.token_ids ) )
# OUTPUT: [CLS] the cat sat on the mat. [SEP]

However, I would expect to see extra [CLS] flags at the beginning and the end. Where am I going wrong?

Hi @smh36, I think a lot of people confuse HF Transformers Tokenizer API with HF Tokenizers (so am I in the first time :joy: ). HF Tokenizers train new vocabularies and tokenizer, and you may design customized tokenization flow with Normalization, Pre-tokenization, Model, Post-tokenization, and …etc. In contrast, HF Transformers Tokenizer API loads pre-trained tokenizer from hub or local files. So it’s clearly if you would want to re-design the pre-trained tokenizer, you should use HF Tokenizers. The following code should help you :hugs:

from tokenizers import Tokenizer
from tokenizers.processors import TemplateProcessing

tokenizer = Tokenizer.from_pretrained("bert-base-uncased") # use HF Tokenizers instead
tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [CLS] [SEP]",
    special_tokens=[
        ("[CLS]", tokenizer.token_to_id("[CLS]")), 
        ("[SEP]", tokenizer.token_to_id("[SEP]")),
    ],
)

tokens = tokenizer.encode('Hi there').tokens
print(tokens) 
# > ['[CLS]', 'hi', 'there', '[CLS]', '[SEP]']


from transformers import PreTrainedTokenizerFast

_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    # Load speacil tokens manually
    # https://huggingface.co/course/chapter6/8?fw=pt
    unk_token = '[UNK]',
    sep_token = '[SEP]',
    pad_token = '[PAD]',
    cls_token = '[CLS]',
    mask_token = '[MASK]',
    model_max_length=128, # same as the block size of the model
)

print(_tokenizer('Hi there').input_ids)
# [101, 7632, 2045, 101, 102]
1 Like