I am a bit confused on the different tokens (with and without ## i.e. continuation prefix) in bert-base-chinese
.
In [1]: from transformers import AutoTokenizer
In [2]: tokenizer = AutoTokenizer.from_pretrained('bert-base-chinese')
In [3]: tokenizer.encode('恋爱ing', add_special_tokens=False)
Out[3]: [2605, 4263, 10139]
In [4]: tokenizer.save_pretrained('tokenizer')
Out[4]:
('tokenizer/vocab.txt',
'tokenizer/special_tokens_map.json',
'tokenizer/added_tokens.json')
In [5]: !grep -n ing tokenizer/vocab.txt
8222:##ing
9108:##ting
9310:booking
9383:king
9427:##ling
9536:##ning
9663:shopping
9741:##king
9756:##ding
10062:ling
10070:wedding
10140:ing
...
In [6]: !grep -n 爱 tokenizer/vocab.txt
4264:爱
17321:##爱
Here it shows that 爱 is tokenized into 4263 instead of 17320, indicating that for this Chinese pre-trained model the two character word is split separately; but then why do we still need the ##爱
token in the vocab?