WordPiece issue - behaves like WordLevel

I don’t know what I’m doing incorrectly, but for some reason, WordPiece behaves like a WordLevel tokenizer.

Note - I do not want to train WordPiece, I have a prebuilt dictionary which I want to use.

Please see a minimalistic code script that reproduces the issue:

from tokenizers import Tokenizer, Regex
from tokenizers.models import WordPiece
from tokenizers.pre_tokenizers import Split 

samples = ['abc','def','abcdef']

model = WordPiece({'abc':10, 'def':20, '<UNK>':100}, unk_token='<UNK>', max_input_chars_per_word = 9999)
tokenizer = Tokenizer(model)

tokenizer.pre_tokenizer = Split(Regex('.*'), behavior='merged_with_previous')

for s in samples:
    print('for input=',s)
    print('standalone pre tekonizer:',tokenizer.pre_tokenizer.pre_tokenize_str(s))
    print('tokenizer output:', tokenizer.encode(s).tokens)

Which outputs:

for input= abc
standalone pre tekonizer: [('abc', (0, 3))]
tokenizer output: ['abc']
for input= def
standalone pre tekonizer: [('def', (0, 3))]
tokenizer output: ['def']
for input= abcdef
standalone pre tekonizer: [('abcdef', (0, 6))]
tokenizer output: ['<UNK>']

I expect the last output to be:
tokenizer output: [‘abc’, ‘def’]

Any idea what am I doing wrong?
Any help will be highly appreciated :slight_smile: