I’m struggling to get the post_processor to work for tokenisation. Can anyone point me in the right direction?
Here’s the minimal example
from transformers import AutoTokenizer
from tokenizers.processors import TemplateProcessing
tokenizer = AutoTokenizer.from_pretrained( 'bert-base-uncased' )
tokenizer.post_processor = TemplateProcessing(
single="[CLS] [CLS] $0 [SEP] [CLS]",
pair="[CLS] $A [SEP] $B:1 [SEP]:1",
special_tokens=[
("[CLS]", tokenizer.cls_token_id),
("[SEP]", tokenizer.sep_token_id)
],
)
text_string = "The cat sat on the mat."
tokens = tokenizer( text_string )
print( tokenizer.decode( tokens.token_ids ) )
# OUTPUT: [CLS] the cat sat on the mat. [SEP]
However, I would expect to see extra [CLS] flags at the beginning and the end. Where am I going wrong?