I’m building a BPE tokenizer from scratch, and I’d like to add some tokens to its vocabulary that are never broken apart and never included in merges.
The second part of that (no tokens in the merges) is easy – the tokens don’t appear in the training text I give the tokenizer. But for the first part (never breaking my special token, e.g.
[Y], into its
] components), I’m less sure of the correct process.
I had though I could just add the tokens to the tokenizer using
add_tokens() before training, and when testing using the Tokenizers library version directly, everything seems fine.
But when I saved out that tokenizer json and load it as the
tokenizer_file param in a Transformers
RobertTokenizerFast and used it to train a BART model, I was getting unexpected results – the added tokens were sometimes getting broken apart into their components.
The problem stopped and tokenization went as expected when I shifted to adding my tokens using
add_special_tokens() instead. But that’s not ideal, because I want my tokens to be part of my output, and not stripped away by the handy
skip_special_tokens=True param in the tokenizer’s decode method.
It’s entirely possible that in my first attempt, I wasn’t loading the saved tokenizer correctly into
RobertaTokenizerFast – it’s a little confusing what can be loaded from the json and what needs to be explicitly passed as a parameter. But after re-reading the docs more carefully, it seems that maybe only special tokens are guaranteed to be atomic? Is there any way to define a token to be atomic but also not in the same class as control tokens like eos and pad?