BertTokenizer.decode not understanding new vocabulary

rberga · December 1, 2023, 1:34pm

I added 3 new items to the BertTokenizer vocabulary (2 emojis and a made-up word), and saved the new vocabulary. Then I instantiated a new BertTokenizer using the new vocabulary file and checked that the tokenizer understood the new words. That worked fine.
Then I ran ‘encode’ to see the tokens encodings and verified that the new encodings were used.

Then I ran ‘decode’ on the encoded tokens and did NOT get the original words back. The new items added to the vocabulary were NOT decoded and were left as [UNK] even though the ‘encode’ generated the correct encodings.

The code below illustrates the problem:

# Importing transformers tokenizer
from transformers import BertTokenizer

# Adding 3 new words or symbols to tokenizer vocabulary: thumbs-up and down emojis and a made-up word
newvocab = [ '👍', '👎', 'babalu' ]
print(newvocab)

# Get basic Bert Tokenizer from pretrained
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# quick check to make sure new vocabulary is not yet present in existing tokenizer
print(tokenizer.tokenize('babalu'))    # ['baba', '##lu']
print(tokenizer.tokenize('👍'))          # ['[UNK]']
print(tokenizer.tokenize('👎'))          # ['[UNK]']

# Get base vocabulary (to add newvocab to)
bert_vocab = tokenizer.get_vocab()

# add newvocab to bert_vocab
print(f'bert_vocab before = {len(bert_vocab)}')
for i, k in enumerate(newvocab, len(bert_vocab)):
    print(f'new item {k} : {i}')
    bert_vocab[k] = i
print(f'bert_vocab after = {len(bert_vocab)}')

# The above lines print:
# bert_vocab before = 30522
# new item 👍 : 30522
# new item 👎 : 30523
# new item babalu : 30524
# bert_vocab after = 30525

# save new vocab file
with open('/tmp/newvocab.tmp', 'w', encoding = 'utf-8') as tmp_vocab_file:
    tmp_vocab_file.write('\n'.join(bert_vocab))

# Get new tokenizer using the new vocabulary file                        
new_bert = BertTokenizer.from_pretrained('bert-base-uncased',vocab_file = '/tmp/newvocab.tmp' )

# Does the new tokenizer understand the new items added to the vocabulary?
new_bert.tokenize('thumbs-up 👍, thumbs-down 👎,  new word babalu.')

# This produces:
# ['thumbs',  '-',  'up',  '👍',  ',',  'thumbs',  '-',  'down',  '👎',  ',',  'new',  'word',  'babalu',  '.']
# which shows that the new tokenizer does understand the new items added to the vocabulary

# Checking the encoding and decoding
# It seems that the ENCODING is using all the new vocabulary entries (i.e., mapping emojis to their encodings)
# But the DECODING is not mapping them back to their original representation.

tokens = new_bert.encode('thumbs-up 👍, thumbs-down 👎,  new word babalu.',
                         add_special_tokens=True,
                         max_length=32,
                         truncation=True
                        )
print(f'tokens after encode:\n{tokens}')
tokens_decoded = tokenizer.decode(tokens)
print(f'tokens after decoding it back:\n{tokens_decoded}')

# this prints:
# tokens after encode:
# [101, 16784, 1011, 2039, 30522, 1010, 16784, 1011, 2091, 30523, 1010, 2047, 2773, 30524, 1012, 102]
# tokens after decoding it back:
# [CLS] thumbs - up [UNK], thumbs - down [UNK], new word [UNK]. [SEP]

# It seems that the new items are being mapped to their correct encodings (30522, 30523, 30524),
# but are not being decoded back to their original representation.

Am I doing anything wrong here?
Thanks!

Topic		Replies	Views
Train a new tokenizer from scratch 🤗Transformers	4	1707	November 10, 2020
Maybe there is a bug in BertTokenizer? 🤗Transformers	0	379	March 19, 2021
Adding new tokens while preserving tokenization of adjacent tokens 🤗Tokenizers	4	18703	January 25, 2024
Using a BertWordPieceTokenizer trained from scratch from transformers 🤗Tokenizers	2	4980	March 26, 2021
Adding new tokens to a BERT tokenizer - Getting ValueError 🤗Tokenizers	2	1475	January 16, 2022

BertTokenizer.decode not understanding new vocabulary

Related topics