I trained a new BertWordPieceTokenizer from scratch, using the same code from the example given in the docs. Then, I created a new TemplateProcessing object & assigned it as the tokenizer’s PostProcessor in order to add [CLS] and [SEP] tokens (also using the example code). However, when I encode sentences with the tokenizer, it doesn’t preform any post-processing.
Code:
from tokenizers import BertWordPieceTokenizer
from tokenizers.processors import TemplateProcessingcorpus = “./corpus.txt”
tokenizer = BertWordPieceTokenizer(
clean_text=True,
handle_chinese_chars=False,
strip_accents=True,
lowercase=True,
)tokenizer.train(
corpus,
vocab_size=32000,
min_frequency=2,
show_progress=True,
special_tokens=[“[UNK]”, “[CLS]”, “[SEP]”],
limit_alphabet=1000,
wordpieces_prefix=“##”,
)tokenizer.post_processor = TemplateProcessing(
single=“[CLS] $A [SEP]”,
pair=“[CLS] $A [SEP] $B:1 [SEP]:1”,
special_tokens=[
(“[CLS]”, tokenizer.token_to_id(“[CLS]”)),
(“[SEP]”, tokenizer.token_to_id(“[SEP]”)),
],
)output = tokenizer.encode(“Hello, y’all! How are you
?”)
print(output.tokens)
Output:
[‘hello’, ‘,’, ‘y’, “'”, ‘all’, ‘!’, ‘how’, ‘are’, ‘you’, ‘[UNK]’, ‘?’]