Hi @smh36, I think a lot of people confuse HF Transformers Tokenizer API with HF Tokenizers (so am I in the first time ). HF Tokenizers train new vocabularies and tokenizer, and you may design customized tokenization flow with Normalization, Pre-tokenization, Model, Post-tokenization, and …etc. In contrast, HF Transformers Tokenizer API loads pre-trained tokenizer from hub or local files. So it’s clearly if you would want to re-design the pre-trained tokenizer, you should use HF Tokenizers. The following code should help you
from tokenizers import Tokenizer
from tokenizers.processors import TemplateProcessing
tokenizer = Tokenizer.from_pretrained("bert-base-uncased") # use HF Tokenizers instead
tokenizer.post_processor = TemplateProcessing(
single="[CLS] $A [CLS] [SEP]",
special_tokens=[
("[CLS]", tokenizer.token_to_id("[CLS]")),
("[SEP]", tokenizer.token_to_id("[SEP]")),
],
)
tokens = tokenizer.encode('Hi there').tokens
print(tokens)
# > ['[CLS]', 'hi', 'there', '[CLS]', '[SEP]']
from transformers import PreTrainedTokenizerFast
_tokenizer = PreTrainedTokenizerFast(
tokenizer_object=tokenizer,
# Load speacil tokens manually
# https://huggingface.co/course/chapter6/8?fw=pt
unk_token = '[UNK]',
sep_token = '[SEP]',
pad_token = '[PAD]',
cls_token = '[CLS]',
mask_token = '[MASK]',
model_max_length=128, # same as the block size of the model
)
print(_tokenizer('Hi there').input_ids)
# [101, 7632, 2045, 101, 102]