Creating a custom tokenizer for Roberta

david-waterworth · December 14, 2020, 12:31am

RobertaTokenizerFast seems to be ignoring my Lowercase() normaliser. I’ve created a custom tokeniser as follows:

tokenizer = Tokenizer(BPE(unk_token="<unk>", end_of_word_suffix="</w>"))
tokenizer.normalizer = Lowercase()
tokenizer.pre_tokenizer = Sequence([Whitespace(), Digits(individual_digits=False), Punctuation()])

trainer = BpeTrainer(
    vocab_size=3000,
    special_tokens=["<s>", "</s>", "<unk>", "<pad>", "<mask>"]
)
tokenizer.train(trainer, files)
tokenizer.post_processor = RobertaProcessing(
    cls=("<s>", tokenizer.token_to_id("<s>")),
    sep=("</s>", tokenizer.token_to_id("</s>")),
)
tokenizer.decoder = BPEDecoder(suffix="</w>")

(I’m not 100% sure if the BPE suffix is required?)

I then save as

tokenizer.model.save("./models/roberta")
tokenizer.save("./models/roberta/config.json")

The reason I need a custom tokeniser is my examples aren’t white-space delimited, i…e

tokenizer.encode("AHU-01-SAT").tokens

[‘<s>’, ‘ahu’, ‘-’, ‘01’, ‘-’, ‘sat’, ‘</s>’]

The following doesn’t return the correct tokens:

from transformers import RobertaTokenizerFast  
tokenizer = RobertaTokenizerFast.from_pretrained(".models/roberta", max_len=512)  
tokenizer("AHU-01-SAT")

{‘input_ids’: [0, 40, 112, 40, 1], ‘attention_mask’: [1, 1, 1, 1, 1]}

It’s missing the first and last tokens (plus it’s not even replacing them with “”?

If I manually apply the normalisation I get the correct tokenisation

tokenizer("ahu-01-sat")

{‘input_ids’: [0, 109, 40, 112, 40, 598, 1], ‘attention_mask’: [1, 1, 1, 1, 1, 1, 1]}

I tried AutoTokenizer and observed the same issue - am I doing something wrong?

david-waterworth · December 14, 2020, 12:56am

Ahh, figured it out, it should be

tokenizer.save("./models/roberta/tokenizer.json")

not

tokenizer.save("./models/roberta/config.json")

david-waterworth · December 14, 2020, 10:05am

One question I still have though is what’s the difference between tokenizer.json and tokenizer_config.json - for example if I use RobertaTokenizerFast.from_pretrained() I get the following.

Didn’t find file added_tokens.json. We won’t load it.
Didn’t find file special_tokens_map.json. We won’t load it.
Didn’t find file tokenizer_config.json. We won’t load it.
loading file vocab.json
loading file merges.txt
loading file tokenizer.json

And If I try and load this tokeniser with AutoTokenizer.from_pretrained() it fails since there’s no config.json - I’m trying to train from scratch using the run_mlm.py script.

Is there a step by step guide to training a (custom) tokenizer, then a model (from scratch) using run_mlm.py?

ninjalu · August 1, 2021, 8:59pm

@david-waterworth Have you found your answer? I am following the code here to train a new language model. I am getting config.json not found when trying to load the pertained tokenizer. :(. Very puzzled, not sure what I did wrong?

david-waterworth · August 1, 2021, 11:06pm

I added a config.json file manually in the same folder as the tokeniser files, i.e.

{
“model_type”: “roberta”,
“vocab_size”: 52000,
“type_vocab_size”: 1
}

I’m not sure there’s a step-by-step guide but in general if you AutoTokenizer.from_pretrained() rather than RobertaTokenizerFast.from_pretrained() then you need a config.json containing at least the model_type so Auto knowns which model family you’re using. Plus any other parameters that differ to the roberta defaults (such as the vocab size).

You need to do the same thing for your model, i.e. in the mlm script it loads AutoConfig.from_pretrained(config_name) to initialise the actual model parameters used in training (number of heads etc). You need to create your own config.json containing the parameters from RobertaConfig so AutoConfig can load them (best thing to do is start by copying the config.json for Roberta from the model hub then modify as required).

david-waterworth · August 1, 2021, 11:19pm

Also note you you create a config file from a config class using save_pretrained, i.e. soemthing like (I’ve not tested this)

config.save_pretrained("./EsperBERTo/config.json")

should work and seems to be missing from the example you’re following

Topic		Replies	Views
Tokenized sequence lengths 🤗Tokenizers	6	2015	March 10, 2022
Fine tune a saved model with custom tokenizer 🤗Transformers	3	2960	December 15, 2020
Pipeline fill-mask error with custom Roberta tokenizer Beginners	1	847	February 8, 2022
Unmasking adds an extra whitespace for BPE tokenizer 🤗Tokenizers	0	271	January 14, 2024
RobertaTokenizer: How to enable masking of custom special tokens 🤗Transformers	1	977	April 24, 2021

Creating a custom tokenizer for Roberta

Related topics