Help with Tokenizer Word Length Limit

I’m hitting what seems to me to be an odd limit on the number of characters a Word Piece tokenizer will process before returning [UNK]. I’m working on a project which uses long strings of generated characters that I’m presenting to BERT as a long, ‘strange-looking’ word. Any word less than 100 characters seems to work. At 101 and greater either tokenizer.encode(msg) or tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(msg) seems to give up and return [UNK]. I haven’t been successful tracing down a reason or a way around it. I should be well within the 512 token limit. I welcome any advice or direction. Thanks!

# word piece tokenizer
tok_ptBert = AutoTokenizer.from_pretrained("bert-base-uncased")
spec_tokens = ['[PAD]','[UNK]', '[MASK]']


trainerX = trainers.WordPieceTrainer(
    vocab_size = 20000,
    min_frequency = 2,
    show_progress=True,
)

# Train tokenizer with new data from tk_train_dataset
tok_ptBert.train_new_from_iterator(tk_train_dataset, 20000)

print('Vocab Size: ' + str(tok_ptBert.vocab_size))

# 100 character long string
msg_100chr_and_below = 'aaaaaaaaaaaaaa  .... aaaaaaaaaaaaaaaa'

# 101 character long string
msg_101chr_and_above =  'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...'

# encode 100 char word
print('raw msg len: ' + str(len(msg_100chr_and_below )))
enc = tok_ptBert.encode(msg_100chr_and_below)
print(enc)
#print(tok_ptBert.tokenize(msg_100chr_and_below))
print(tok_ptBert.convert_ids_to_tokens(enc))

print('===============')

# encode 101(+) char word
print('raw msg len: ' + str(len(msg_101chr_and_above)))
enc = tok_ptBert.encode(msg_101chr_and_above)
print(enc)
#print(tok_ptBert.tokenize(msg_101chr_and_above))
print(tok_ptBert.convert_ids_to_tokens(enc))


Output:

Vocab Size: 30522

raw msg len: 100
[101, 13360, 11057, 11057, [...], 11057, 11057, 11057, 11057, 2050, 102]
['[CLS]', 'aaa', '##aa', '##aa', [...], '##aa', '##aa', '##aa', '##aa', '##a', '[SEP]']
===============
raw msg len: 101
[101, 100, 102]
['[CLS]', '[UNK]', '[SEP]']

I had the same problem. WordPiece tokenizer has an argument max_input_chars_per_word, see here. The default is set to 100 in the rust implementation.

You can change this to a value larger than your max number of characters per ‘word’, e.g.:

>>> from transformers import AutoTokenizer
>>> tok_ptBert = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> tok_ptBert.tokenize('a' * 101)
['[UNK]']
>>> tok_ptBert._tokenizer.model.max_input_chars_per_word = 1000
>>> tok_ptBert.tokenize('a' * 101)
['aaa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa', '##aa']

Brilliant! Thanks mormart!