I added 3 new items to the BertTokenizer vocabulary (2 emojis and a made-up word), and saved the new vocabulary. Then I instantiated a new BertTokenizer using the new vocabulary file and checked that the tokenizer understood the new words. That worked fine.
Then I ran āencodeā to see the tokens encodings and verified that the new encodings were used.
Then I ran ādecodeā on the encoded tokens and did NOT get the original words back. The new items added to the vocabulary were NOT decoded and were left as [UNK] even though the āencodeā generated the correct encodings.
The code below illustrates the problem:
# Importing transformers tokenizer
from transformers import BertTokenizer
# Adding 3 new words or symbols to tokenizer vocabulary: thumbs-up and down emojis and a made-up word
newvocab = [ 'š', 'š', 'babalu' ]
print(newvocab)
# Get basic Bert Tokenizer from pretrained
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# quick check to make sure new vocabulary is not yet present in existing tokenizer
print(tokenizer.tokenize('babalu')) # ['baba', '##lu']
print(tokenizer.tokenize('š')) # ['[UNK]']
print(tokenizer.tokenize('š')) # ['[UNK]']
# Get base vocabulary (to add newvocab to)
bert_vocab = tokenizer.get_vocab()
# add newvocab to bert_vocab
print(f'bert_vocab before = {len(bert_vocab)}')
for i, k in enumerate(newvocab, len(bert_vocab)):
print(f'new item {k} : {i}')
bert_vocab[k] = i
print(f'bert_vocab after = {len(bert_vocab)}')
# The above lines print:
# bert_vocab before = 30522
# new item š : 30522
# new item š : 30523
# new item babalu : 30524
# bert_vocab after = 30525
# save new vocab file
with open('/tmp/newvocab.tmp', 'w', encoding = 'utf-8') as tmp_vocab_file:
tmp_vocab_file.write('\n'.join(bert_vocab))
# Get new tokenizer using the new vocabulary file
new_bert = BertTokenizer.from_pretrained('bert-base-uncased',vocab_file = '/tmp/newvocab.tmp' )
# Does the new tokenizer understand the new items added to the vocabulary?
new_bert.tokenize('thumbs-up š, thumbs-down š, new word babalu.')
# This produces:
# ['thumbs', '-', 'up', 'š', ',', 'thumbs', '-', 'down', 'š', ',', 'new', 'word', 'babalu', '.']
# which shows that the new tokenizer does understand the new items added to the vocabulary
# Checking the encoding and decoding
# It seems that the ENCODING is using all the new vocabulary entries (i.e., mapping emojis to their encodings)
# But the DECODING is not mapping them back to their original representation.
tokens = new_bert.encode('thumbs-up š, thumbs-down š, new word babalu.',
add_special_tokens=True,
max_length=32,
truncation=True
)
print(f'tokens after encode:\n{tokens}')
tokens_decoded = tokenizer.decode(tokens)
print(f'tokens after decoding it back:\n{tokens_decoded}')
# this prints:
# tokens after encode:
# [101, 16784, 1011, 2039, 30522, 1010, 16784, 1011, 2091, 30523, 1010, 2047, 2773, 30524, 1012, 102]
# tokens after decoding it back:
# [CLS] thumbs - up [UNK], thumbs - down [UNK], new word [UNK]. [SEP]
# It seems that the new items are being mapped to their correct encodings (30522, 30523, 30524),
# but are not being decoded back to their original representation.
Am I doing anything wrong here?
Thanks!