Tokenizer post_processor help

smh36 · August 22, 2022, 4:17pm

I’m struggling to get the post_processor to work for tokenisation. Can anyone point me in the right direction?

Here’s the minimal example

from transformers import AutoTokenizer
from tokenizers.processors import TemplateProcessing

tokenizer = AutoTokenizer.from_pretrained( 'bert-base-uncased' )

tokenizer.post_processor = TemplateProcessing(
    single="[CLS] [CLS] $0 [SEP] [CLS]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", tokenizer.cls_token_id), 
        ("[SEP]", tokenizer.sep_token_id)
    ],
)

text_string = "The cat sat on the mat."

tokens = tokenizer( text_string )

print( tokenizer.decode( tokens.token_ids ) )
# OUTPUT: [CLS] the cat sat on the mat. [SEP]

However, I would expect to see extra [CLS] flags at the beginning and the end. Where am I going wrong?

lianghsun · October 27, 2022, 7:48am

Hi @smh36, I think a lot of people confuse HF Transformers Tokenizer API with HF Tokenizers (so am I in the first time ). HF Tokenizers train new vocabularies and tokenizer, and you may design customized tokenization flow with Normalization, Pre-tokenization, Model, Post-tokenization, and …etc. In contrast, HF Transformers Tokenizer API loads pre-trained tokenizer from hub or local files. So it’s clearly if you would want to re-design the pre-trained tokenizer, you should use HF Tokenizers. The following code should help you

from tokenizers import Tokenizer
from tokenizers.processors import TemplateProcessing

tokenizer = Tokenizer.from_pretrained("bert-base-uncased") # use HF Tokenizers instead
tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [CLS] [SEP]",
    special_tokens=[
        ("[CLS]", tokenizer.token_to_id("[CLS]")), 
        ("[SEP]", tokenizer.token_to_id("[SEP]")),
    ],
)

tokens = tokenizer.encode('Hi there').tokens
print(tokens) 
# > ['[CLS]', 'hi', 'there', '[CLS]', '[SEP]']


from transformers import PreTrainedTokenizerFast

_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    # Load speacil tokens manually
    # https://huggingface.co/course/chapter6/8?fw=pt
    unk_token = '[UNK]',
    sep_token = '[SEP]',
    pad_token = '[PAD]',
    cls_token = '[CLS]',
    mask_token = '[MASK]',
    model_max_length=128, # same as the block size of the model
)

print(_tokenizer('Hi there').input_ids)
# [101, 7632, 2045, 101, 102]

Topic		Replies	Views
Issue with post-processing 🤗Tokenizers	1	1104	June 15, 2022
Preprocessing raw text 🤗Tokenizers	2	595	October 26, 2022
BOS tokens for mBERT tokenizer 🤗Tokenizers	1	634	April 14, 2021
Custom PostProcessor? 🤗Tokenizers	0	915	November 10, 2022
TemplateProcessing for encoder-decoder 🤗Tokenizers	0	504	November 16, 2022

Tokenizer post_processor help

Related topics