Creating a custom tokenizer for Roberta

RobertaTokenizerFast seems to be ignoring my Lowercase() normaliser. I’ve created a custom tokeniser as follows:

tokenizer = Tokenizer(BPE(unk_token="<unk>", end_of_word_suffix="</w>"))
tokenizer.normalizer = Lowercase()
tokenizer.pre_tokenizer = Sequence([Whitespace(), Digits(individual_digits=False), Punctuation()])

trainer = BpeTrainer(
    vocab_size=3000,
    special_tokens=["<s>", "</s>", "<unk>", "<pad>", "<mask>"]
)
tokenizer.train(trainer, files)
tokenizer.post_processor = RobertaProcessing(
    cls=("<s>", tokenizer.token_to_id("<s>")),
    sep=("</s>", tokenizer.token_to_id("</s>")),
)
tokenizer.decoder = BPEDecoder(suffix="</w>")

(I’m not 100% sure if the BPE suffix is required?)

I then save as

tokenizer.model.save("./models/roberta")
tokenizer.save("./models/roberta/config.json") 

The reason I need a custom tokeniser is my examples aren’t white-space delimited, i…e

tokenizer.encode("AHU-01-SAT").tokens

[‘<s>’, ‘ahu’, ‘-’, ‘01’, ‘-’, ‘sat’, ‘</s>’]

The following doesn’t return the correct tokens:

from transformers import RobertaTokenizerFast  
tokenizer = RobertaTokenizerFast.from_pretrained(".models/roberta", max_len=512)  
tokenizer("AHU-01-SAT")

{‘input_ids’: [0, 40, 112, 40, 1], ‘attention_mask’: [1, 1, 1, 1, 1]}

It’s missing the first and last tokens (plus it’s not even replacing them with “”?

If I manually apply the normalisation I get the correct tokenisation

tokenizer("ahu-01-sat")

{‘input_ids’: [0, 109, 40, 112, 40, 598, 1], ‘attention_mask’: [1, 1, 1, 1, 1, 1, 1]}

I tried AutoTokenizer and observed the same issue - am I doing something wrong?

Ahh, figured it out, it should be

tokenizer.save("./models/roberta/tokenizer.json") 

not

tokenizer.save("./models/roberta/config.json")

One question I still have though is what’s the difference between tokenizer.json and tokenizer_config.json - for example if I use RobertaTokenizerFast.from_pretrained() I get the following.

Didn’t find file added_tokens.json. We won’t load it.
Didn’t find file special_tokens_map.json. We won’t load it.
Didn’t find file tokenizer_config.json. We won’t load it.
loading file vocab.json
loading file merges.txt
loading file tokenizer.json

And If I try and load this tokeniser with AutoTokenizer.from_pretrained() it fails since there’s no config.json - I’m trying to train from scratch using the run_mlm.py script.

Is there a step by step guide to training a (custom) tokenizer, then a model (from scratch) using run_mlm.py?

@david-waterworth Have you found your answer? I am following the code here to train a new language model. I am getting config.json not found when trying to load the pertained tokenizer. :(. Very puzzled, not sure what I did wrong?

I added a config.json file manually in the same folder as the tokeniser files, i.e.

{
“model_type”: “roberta”,
“vocab_size”: 52000,
“type_vocab_size”: 1
}

I’m not sure there’s a step-by-step guide but in general if you AutoTokenizer.from_pretrained() rather than RobertaTokenizerFast.from_pretrained() then you need a config.json containing at least the model_type so Auto knowns which model family you’re using. Plus any other parameters that differ to the roberta defaults (such as the vocab size).

You need to do the same thing for your model, i.e. in the mlm script it loads AutoConfig.from_pretrained(config_name) to initialise the actual model parameters used in training (number of heads etc). You need to create your own config.json containing the parameters from RobertaConfig so AutoConfig can load them (best thing to do is start by copying the config.json for Roberta from the model hub then modify as required).

Also note you you create a config file from a config class using save_pretrained, i.e. soemthing like (I’ve not tested this)

config.save_pretrained("./EsperBERTo/config.json")

should work and seems to be missing from the example you’re following