RobertaTokenizerFast seems to be ignoring my Lowercase() normaliser. I’ve created a custom tokeniser as follows:
tokenizer = Tokenizer(BPE(unk_token="<unk>", end_of_word_suffix="</w>"))
tokenizer.normalizer = Lowercase()
tokenizer.pre_tokenizer = Sequence([Whitespace(), Digits(individual_digits=False), Punctuation()])
trainer = BpeTrainer(
vocab_size=3000,
special_tokens=["<s>", "</s>", "<unk>", "<pad>", "<mask>"]
)
tokenizer.train(trainer, files)
tokenizer.post_processor = RobertaProcessing(
cls=("<s>", tokenizer.token_to_id("<s>")),
sep=("</s>", tokenizer.token_to_id("</s>")),
)
tokenizer.decoder = BPEDecoder(suffix="</w>")
(I’m not 100% sure if the BPE suffix is required?)
I then save as
tokenizer.model.save("./models/roberta")
tokenizer.save("./models/roberta/config.json")
The reason I need a custom tokeniser is my examples aren’t white-space delimited, i…e
tokenizer.encode("AHU-01-SAT").tokens
[‘<s>’, ‘ahu’, ‘-’, ‘01’, ‘-’, ‘sat’, ‘</s>’]
The following doesn’t return the correct tokens:
from transformers import RobertaTokenizerFast
tokenizer = RobertaTokenizerFast.from_pretrained(".models/roberta", max_len=512)
tokenizer("AHU-01-SAT")
{‘input_ids’: [0, 40, 112, 40, 1], ‘attention_mask’: [1, 1, 1, 1, 1]}
It’s missing the first and last tokens (plus it’s not even replacing them with “”?
If I manually apply the normalisation I get the correct tokenisation
tokenizer("ahu-01-sat")
{‘input_ids’: [0, 109, 40, 112, 40, 598, 1], ‘attention_mask’: [1, 1, 1, 1, 1, 1, 1]}
I tried AutoTokenizer and observed the same issue - am I doing something wrong?