Add new tokens for subwords

Hey, I am trying to add subword tokens to bert base uncased as follows:

num = tokenizer_bert.add_tokens(’##committed’)

[‘hello’, ‘##com’, ‘##mit’, ‘##ted’]

It seems like the tokenizer is literally adding the hashtags, when I would want to create a new subword called ##commited. I am doing this to deal with hashtags & thinking of initializing those new subwords to their original words.

Any solutions / better ways to deal with hashtags, would be really appreciated!
Thanks –

I’m not sure if adding subwords directly is possible, you can try to add them as special tokens instead so only id will be created for them instead of splitting. Pinging @anthony for more details.

1 Like

The tokens you add with add_tokens are not added directly to the original vocabulary, but instead they are part of a special vocabulary. They end up being handled first, so that what you define manually always has the priority.
As you noticed, if you specify ##committed in the input text, it will use your token, but not without the ##. This is simply because they are treated literally, just as you added them.

So, you should be able to achieve what you want by doing:

tokenizer.add_tokens([ 'committed' ])
# [ 'hello', 'commited' ]

Thanks for the help.

I tried both of your answers, but neither works:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer.add_tokens([  'committed'  ])
>> ['hello', '##com', '##mit', '##ted']

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer.add_special_tokens({'unk_token': 'committed' })
>> ['hello', '##com', '##mit', '##ted']

(Note: add_special_tokens failed with an assertion error, so i used unk_token as key)

Any other ideas?

Edit: What does work though is if I tokenize two new words, as follows:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
num = tokenizer.add_tokens(['helloxxx', 'committed'])
>> ['helloxxx', 'committed']

So I guess I could delete every word in the vocab and re-add it to achieve my goal, but I imagine there must be a better solution?

Any other ideas for solving this? Thanks!

Hi, silly question: why do you have num = tokenizer.add_tokens for your first and last attempts, but just tokenizer.add_tokens for the middle one?

Add_tokens returns the number of tokens you have added; For the middle one I just copied the answers from @anthony & @valhalla, so I didnt add it.

OK, thanks for replying.

What happens if you do tokenizer.tokenize(‘hellocommitted’)
immediately after doing tokenizer.tokenize(‘helloxxxcommitted’)

What is the result of tokenizer.tokenize(‘uncommitted’)

Clearly, it has managed to create the ‘committed’ token, so it seems very odd that it wouldn’t use it except when it arrives with helloxxx. Is that really happening, or is it just that your earlier attempt failed for some reason, ie when you did
tokenizer.add_tokens([ ‘committed’ ])

Maybe it doesn’t like the spaces, or maybe there was a system failure at the wrong moment.

I just noticed something else odd: your last example should have output ‘helloxxx’, ‘##committed’ (if the ‘committed’ token has been added at all). Are you sure that’s what it said?

On a different tack, what do you mean by “deal with hashtags” ? Are you using twitter data? What does your bert model currently do with # inputs? Are you sure that that wouldn’t be fine? Do you have enough data to retrain the model to deal with the new ‘committed’ token? (see the answer to this post ). Could you preprocess all your input data so that the input # characters become something else that doesn’t have a special meaning within transformers?

Thanks for your help!

Here are all outputs & commands you mentioned:

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained(‘bert-base-uncased’)
num = tokenizer.add_tokens([‘helloxxx’, ‘committed’])
[‘helloxxx’, ‘committed’]

[‘hello’, ‘##com’, ‘##mit’, ‘##ted’]

[‘un’, ‘##com’, ‘##mit’, ‘##ted’]

The reason it does not work is that the tokenizer already has the word commited, hence it does not add it, even though I’d like to add it as an additional substring.

By dealing with hashtags, I mean that I have words like this: #justnorthkoreathings and they get tokenized as: [’#’, ‘just’, ‘##nor’, ‘##th’, ‘##kor’, ‘##ea’, ‘##thing’, ‘##s’]. While I have a lot of confidence in BERT, I doubt this tokenization is a very good signal, as a) BERT is pretrained on Wikipedia articles & books where hashtags aren’t the norm b) I’d argue that attention is at its best, the fewer words it needs to combine - here it needs to combine 4 elements to understand north korea…

Would appreciate any tips, how to deal with this!

Hello again, I didn’t quite believe the system would be doing what your data showed, but I’ve checked it myself and it really is! I don’t think it should be working quite like that:
Following an added token, the next token is shown without continuation ## and is a complete word.
(so, as you said, when ‘helloxxx’ is an added token, ‘helloxxxcommitted’ is tokenized as [‘helloxxx’, ‘committed’] )
The same effect happens following a ‘#’ (’#justnorth’ is tokenized as [’#’, ‘just’, ‘##nor’, ‘##th’]).

However, that’s no help to you.

It doesn’t seem to be possible to add continuation tokens.

If I were working on this, I would start by considering whether to drop all the hashtags completely. Do they actually add any information that isn’t also in the rest of the tweet?

Another possibility would be to replace all of them with a single-word string, such as ‘#hashtag’ .

If I decided the words in the hashtags were required, I would pre-process them before passing them to the tokenizer. One option would be to split them into words. (For partial code for this see eg ). There would then be issues with choosing which of the possible splits, but I expect that choosing the fewest subwords each time would be good enough.

If you have a very small dataset, you might consider replacing the hashtags manually, with meaningful phrases.

I think you are probably correct in thinking that Bert will not have seen ‘nor ##th ##kor ##ea’ very often. On the other hand, it might not have seen ‘north korea’ often enough to develop a particularly useful representation.

If you have a really large dataset, you could train your own tokenizer, including the hashtags in the vocab, but that would be a last resort.

I am not an expert, but I’m not sure that a large number of tokens per word is actually a problem. After all, Bert is used to working with very long strings of tokens, making some kind of overall representation of a text, so it isn’t restricted to single word meanings. If the vocab is only about 30000, then there must be a large number of words that have to be represented by two or more tokens, so Bert must be quite good at dealing with these.