Padding and truncation for custom tokenizer


Is there a way to ensure that my custom tokenizer pads and tunicates my inputs?

This is my code:

# Tokenizer
tokenizer = Tokenizer(models.WordPiece())

spl_tokens = ["[UNK]", "[SEP]", "[MASK]", "[CLS]", "[PAD]"]  # special tokens
trainer = trainers.WordPieceTrainer(special_tokens = spl_tokens)

tokenizer.normalizer = normalizers.BertNormalizer(lowercase=False)

pre_tokenizer = CharDelimiterSplit(" ")

tokenizer.pre_tokenizer = pre_tokenizer

tokenizer.decoder = decoders.WordPiece()

tokenizer.train([file_path], trainer) # training the tokenzier

cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")
print("[CLS] id = ", cls_token_id, ", [SEP] id = ", sep_token_id)

tokenizer.post_processor = processors.TemplateProcessing(
    single=f"[CLS]:0 $0 [SEP]:0",
        ("[CLS]", cls_token_id),
        ("[SEP]", sep_token_id),

I can’t seem to get my model to behave like the tokenizer in this example:

Also, I’ve tried wrapping it inside the BertTokenizerFast object and calling in in the following way:

new_tokenizer = BertTokenizerFast(tokenizer_object=tokenizer, max_len=1024)

new_tokenizer.encode_plus(example_str, padding=True, truncation=True, add_special_tokens=True)

It still doesn’t seem to work