Issue with post-processing

YBS · December 8, 2020, 4:49pm

I trained a new BertWordPieceTokenizer from scratch, using the same code from the example given in the docs. Then, I created a new TemplateProcessing object & assigned it as the tokenizer’s PostProcessor in order to add [CLS] and [SEP] tokens (also using the example code). However, when I encode sentences with the tokenizer, it doesn’t preform any post-processing.

Code:

from tokenizers import BertWordPieceTokenizer
from tokenizers.processors import TemplateProcessing

corpus = “./corpus.txt”

tokenizer = BertWordPieceTokenizer(
clean_text=True,
handle_chinese_chars=False,
strip_accents=True,
lowercase=True,
)

tokenizer.train(
corpus,
vocab_size=32000,
min_frequency=2,
show_progress=True,
special_tokens=[“[UNK]”, “[CLS]”, “[SEP]”],
limit_alphabet=1000,
wordpieces_prefix=“##”,
)

tokenizer.post_processor = TemplateProcessing(
single=“[CLS] $A [SEP]”,
pair=“[CLS] $A [SEP] $B:1 [SEP]:1”,
special_tokens=[
(“[CLS]”, tokenizer.token_to_id(“[CLS]”)),
(“[SEP]”, tokenizer.token_to_id(“[SEP]”)),
],
)

output = tokenizer.encode(“Hello, y’all! How are you ?”)
print(output.tokens)

Output:

[‘hello’, ‘,’, ‘y’, “'”, ‘all’, ‘!’, ‘how’, ‘are’, ‘you’, ‘[UNK]’, ‘?’]

rbawden · June 15, 2022, 8:09am

Hi there! Did you end up finding a solution to the problem?

Topic		Replies	Views
Tokenizer post_processor help 🤗Tokenizers	1	1394	October 27, 2022
BOS tokens for mBERT tokenizer 🤗Tokenizers	1	636	April 14, 2021
Preprocessing raw text 🤗Tokenizers	2	601	October 26, 2022
TemplateProcessing for encoder-decoder 🤗Tokenizers	0	509	November 16, 2022
Disabling addition of CLS from BERT tokenizer 🤗Tokenizers	5	1826	March 11, 2022

Issue with post-processing

Related topics