Hello again, I didn’t quite believe the system would be doing what your data showed, but I’ve checked it myself and it really is! I don’t think it should be working quite like that:
Following an added token, the next token is shown without continuation ## and is a complete word.
(so, as you said, when ‘helloxxx’ is an added token, ‘helloxxxcommitted’ is tokenized as [‘helloxxx’, ‘committed’] )
The same effect happens following a ‘#’ (’#justnorth’ is tokenized as [’#’, ‘just’, ‘##nor’, ‘##th’]).
However, that’s no help to you.
It doesn’t seem to be possible to add continuation tokens.
If I were working on this, I would start by considering whether to drop all the hashtags completely. Do they actually add any information that isn’t also in the rest of the tweet?
Another possibility would be to replace all of them with a single-word string, such as ‘#hashtag’ .
If I decided the words in the hashtags were required, I would pre-process them before passing them to the tokenizer. One option would be to split them into words. (For partial code for this see eg https://www.techiedelight.com/word-break-problem/ ). There would then be issues with choosing which of the possible splits, but I expect that choosing the fewest subwords each time would be good enough.
If you have a very small dataset, you might consider replacing the hashtags manually, with meaningful phrases.
I think you are probably correct in thinking that Bert will not have seen ‘nor ##th ##kor ##ea’ very often. On the other hand, it might not have seen ‘north korea’ often enough to develop a particularly useful representation.
If you have a really large dataset, you could train your own tokenizer, including the hashtags in the vocab, but that would be a last resort.
I am not an expert, but I’m not sure that a large number of tokens per word is actually a problem. After all, Bert is used to working with very long strings of tokens, making some kind of overall representation of a text, so it isn’t restricted to single word meanings. If the vocab is only about 30000, then there must be a large number of words that have to be represented by two or more tokens, so Bert must be quite good at dealing with these.