Hello, I am working on customizing a tokenizer.
Tokenizer.json structure looks like this.
{version,
truncation,
padding,
added_tokens : [
{added_token1},
{added_token2},
],
normalizer,
pre_tokenizer,
post_processor,
model : {..., ...,
vocab : {a : 0,
b : 1,
c : 2, ...}
}
}
So I build it up from scratch using BPE model. Below is the code.
class SettingTokenizer:
@staticmethod
def set_tokenizer_and_trainer():
tokenizer = Tokenizer(models.BPE())
tokenizer.normalizer = normalizers.Sequence(
[normalizers.BertNormalizer(strip_accents=True), normalizers.Replace("\\r\\n", " ")]
)
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()
tokenizer.decoder = decoders.ByteLevel()
trainer = trainers.BpeTrainer(
vocab_size=100000,
min_frequency=10,
initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"],
)
return tokenizer, trainer
Everything seems fine to me at first train step.
Let’s say I trained a new empty tokenizer with 10,000 train data.
And I got the vocab with size 300. (No added tokens yet)
The next day I trained with other 5,000 train data.
The new tokens are updated in added_tokens.
The following image explains the logic I implemented.
The problem I am having now is that loaded tokenizer seems to use tokens in added_tokens first rather than tokens in vocab.
For example, a word “request” is tokenized into “requ” and “est” (input_ids = [1336, 400] )
But the tokenizer could use the token “reques”,“t” or “request”.
The input_ids would be [265,88] or [266].
As you can see the above image, the token “request” is already in vocab. (“requ” and “est” are in the added_tokens)
Why tokenizer prefer added_tokens to vocab?
How can I set this tokenizer to use vocab tokens first?
Thx!!