Hi,
Is there a way to ensure that my custom tokenizer pads and tunicates my inputs?
This is my code:
# Tokenizer
tokenizer = Tokenizer(models.WordPiece())
spl_tokens = ["[UNK]", "[SEP]", "[MASK]", "[CLS]", "[PAD]"] # special tokens
trainer = trainers.WordPieceTrainer(special_tokens = spl_tokens)
tokenizer.normalizer = normalizers.BertNormalizer(lowercase=False)
pre_tokenizer = CharDelimiterSplit(" ")
tokenizer.pre_tokenizer = pre_tokenizer
tokenizer.decoder = decoders.WordPiece()
tokenizer.train([file_path], trainer) # training the tokenzier
cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")
print("[CLS] id = ", cls_token_id, ", [SEP] id = ", sep_token_id)
tokenizer.post_processor = processors.TemplateProcessing(
single=f"[CLS]:0 $0 [SEP]:0",
special_tokens=[
("[CLS]", cls_token_id),
("[SEP]", sep_token_id),
],
)
tokenizer.save("tokenizer-trained.json")
I can’t seem to get my model to behave like the tokenizer in this example: