I’m hitting what seems to me to be an odd limit on the number of characters a Word Piece tokenizer will process before returning [UNK]. I’m working on a project which uses long strings of generated characters that I’m presenting to BERT as a long, ‘strange-looking’ word. Any word less than 100 characters seems to work. At 101 and greater either tokenizer.encode(msg) or tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(msg) seems to give up and return [UNK]. I haven’t been successful tracing down a reason or a way around it. I should be well within the 512 token limit. I welcome any advice or direction. Thanks!
# word piece tokenizer
tok_ptBert = AutoTokenizer.from_pretrained("bert-base-uncased")
spec_tokens = ['[PAD]','[UNK]', '[MASK]']
trainerX = trainers.WordPieceTrainer(
vocab_size = 20000,
min_frequency = 2,
show_progress=True,
)
# Train tokenizer with new data from tk_train_dataset
tok_ptBert.train_new_from_iterator(tk_train_dataset, 20000)
print('Vocab Size: ' + str(tok_ptBert.vocab_size))
# 100 character long string
msg_100chr_and_below = 'aaaaaaaaaaaaaa .... aaaaaaaaaaaaaaaa'
# 101 character long string
msg_101chr_and_above = 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...'
# encode 100 char word
print('raw msg len: ' + str(len(msg_100chr_and_below )))
enc = tok_ptBert.encode(msg_100chr_and_below)
print(enc)
#print(tok_ptBert.tokenize(msg_100chr_and_below))
print(tok_ptBert.convert_ids_to_tokens(enc))
print('===============')
# encode 101(+) char word
print('raw msg len: ' + str(len(msg_101chr_and_above)))
enc = tok_ptBert.encode(msg_101chr_and_above)
print(enc)
#print(tok_ptBert.tokenize(msg_101chr_and_above))
print(tok_ptBert.convert_ids_to_tokens(enc))
Output:
Vocab Size: 30522
raw msg len: 100
[101, 13360, 11057, 11057, [...], 11057, 11057, 11057, 11057, 2050, 102]
['[CLS]', 'aaa', '##aa', '##aa', [...], '##aa', '##aa', '##aa', '##aa', '##a', '[SEP]']
===============
raw msg len: 101
[101, 100, 102]
['[CLS]', '[UNK]', '[SEP]']