for example,
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
>>> tokenizer.add_tokens(' ')
1
>>> tokenizer.encode('ä½ å¥½ 世界', add_special_tokens=False)
[872, 1962, 21128, 686, 4518]
>>> tokenizer.encode(['ä½ ','好',' ', '世', 'ç•Œ'], is_split_into_words=True, add_special_tokens=False)
[872, 1962, 686, 4518]
Obviously, the blank token is ignored. But if you change it to another token like ‘[balabala]’, it works.
So what is the proper way to do this?