Add new tokens for subwords

anthony · July 29, 2020, 11:56pm

The tokens you add with add_tokens are not added directly to the original vocabulary, but instead they are part of a special vocabulary. They end up being handled first, so that what you define manually always has the priority.
As you noticed, if you specify ##committed in the input text, it will use your token, but not without the ##. This is simply because they are treated literally, just as you added them.

So, you should be able to achieve what you want by doing:

tokenizer.add_tokens([ 'committed' ])
tokenizer.tokenizer('hellocommitted')
# [ 'hello', 'commited' ]

Topic		Replies	Views
Adding new tokens while preserving tokenization of adjacent tokens 🤗Tokenizers	4	18763	January 25, 2024
Maybe there is a bug in BertTokenizer? 🤗Transformers	0	380	March 19, 2021
Replace special [unusedX] tokens in a tokenizer to add domain-specific words Intermediate	0	1099	October 12, 2023
Process to adding new tokens to a corpus and subsequently training the corresponding word embeddings Beginners	0	3764	April 21, 2021
Space token ' ' cannot be add when is_split_into_words = True 🤗Tokenizers	1	460	March 11, 2021

Add new tokens for subwords

Related topics