Hello again, I didnât quite believe the system would be doing what your data showed, but Iâve checked it myself and it really is! I donât think it should be working quite like that:
Following an added token, the next token is shown without continuation ## and is a complete word.
(so, as you said, when âhelloxxxâ is an added token, âhelloxxxcommittedâ is tokenized as [âhelloxxxâ, âcommittedâ] )
The same effect happens following a â#â (â#justnorthâ is tokenized as [â#â, âjustâ, â##norâ, â##thâ]).
However, thatâs no help to you.
It doesnât seem to be possible to add continuation tokens.
If I were working on this, I would start by considering whether to drop all the hashtags completely. Do they actually add any information that isnât also in the rest of the tweet?
Another possibility would be to replace all of them with a single-word string, such as â#hashtagâ .
If I decided the words in the hashtags were required, I would pre-process them before passing them to the tokenizer. One option would be to split them into words. (For partial code for this see eg https://www.techiedelight.com/word-break-problem/ ). There would then be issues with choosing which of the possible splits, but I expect that choosing the fewest subwords each time would be good enough.
If you have a very small dataset, you might consider replacing the hashtags manually, with meaningful phrases.
I think you are probably correct in thinking that Bert will not have seen ânor ##th ##kor ##eaâ very often. On the other hand, it might not have seen ânorth koreaâ often enough to develop a particularly useful representation.
If you have a really large dataset, you could train your own tokenizer, including the hashtags in the vocab, but that would be a last resort.
I am not an expert, but Iâm not sure that a large number of tokens per word is actually a problem. After all, Bert is used to working with very long strings of tokens, making some kind of overall representation of a text, so it isnât restricted to single word meanings. If the vocab is only about 30000, then there must be a large number of words that have to be represented by two or more tokens, so Bert must be quite good at dealing with these.