Tokenizer tend to choose added tokens first rather than token in vocab

Hello, I am working on customizing a tokenizer.

Tokenizer.json structure looks like this.

{version,
truncation,
padding,
added_tokens : [
    {added_token1},
    {added_token2},
    ],
normalizer,
pre_tokenizer,
post_processor,
model : {..., ..., 
    vocab : {a : 0,
                  b : 1,
                  c : 2, ...}
         }
}

So I build it up from scratch using BPE model. Below is the code.

class SettingTokenizer:
    
    @staticmethod
    def set_tokenizer_and_trainer():
        tokenizer = Tokenizer(models.BPE())
        tokenizer.normalizer = normalizers.Sequence(
            [normalizers.BertNormalizer(strip_accents=True), normalizers.Replace("\\r\\n", " ")]
        )
        tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()
        tokenizer.decoder = decoders.ByteLevel()
        trainer = trainers.BpeTrainer(
            vocab_size=100000,
            min_frequency=10,
            initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
            special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"],
        )
        return tokenizer, trainer

Everything seems fine to me at first train step.
Let’s say I trained a new empty tokenizer with 10,000 train data.
And I got the vocab with size 300. (No added tokens yet)

The next day I trained with other 5,000 train data.
The new tokens are updated in added_tokens.

The following image explains the logic I implemented.

The problem I am having now is that loaded tokenizer seems to use tokens in added_tokens first rather than tokens in vocab.

For example, a word “request” is tokenized into “requ” and “est” (input_ids = [1336, 400] )
But the tokenizer could use the token “reques”,“t” or “request”.
The input_ids would be [265,88] or [266].

As you can see the above image, the token “request” is already in vocab. (“requ” and “est” are in the added_tokens)
Why tokenizer prefer added_tokens to vocab?
How can I set this tokenizer to use vocab tokens first?

Thx!! :slightly_smiling_face:

I find a trick.
Instead of just using list of new tokens for the function add_token, Use with AddedToken class
like this.

at3 = AddedToken('est', single_word=True)

You can use some arguments in there. I choose single_word as True.

If it is False, then it would tokenize “West” and “Request” into [“w”, “est”] and [“requ”, “est”].
But if it is True, then it would not tokenize into seperate tokens.
This is not the solution, but I am good with this.

GLTA! :stuck_out_tongue_winking_eye: