I am trying to train a word level tokenizer with a few special tokens. I want for these special tokens to have a mapping to an id, regardless of whether they appear in the training samples for the tokenizer (this is the deault behaviour for special tokens). I train the tokenizer like so
>>> import pandas as pd
>>> import numpy as np
>>>
>>> from tokenizers import Tokenizer
>>> from tokenizers.models import WordLevel
>>> from tokenizers.trainers import WordLevelTrainer
>>>
>>> train_samples = pd.Series([
... np.array(["token_1", "token_2", "token_3"]),
... np.array(["token_4", "token_5", "token_6"]),
... ])
>>>
>>> special_tokens = [
... "token_5", "token_6", "token_100"
... ]
>>> tokenizer = Tokenizer(WordLevel(unk_token="[UNK]"))
>>> trainer = WordLevelTrainer(special_tokens=["[UNK]"] + special_tokens)
>>> tokenizer.train_from_iterator(train_samples, trainer)
>>> tokenizer.get_vocab()
{'token_2': 5,
'token_4': 7,
'token_3': 6,
'token_1': 4,
'token_100': 3,
'token_6': 9,
'token_5': 8,
'[UNK]': 0}
While encoding a few test samples containing special, non-special and unseen strings, here is what i get
>>> test_sample = [
... "token_1",
... "token_2_extra",
... "token_5_extra",
... "token_6_extra",
... ]
>>> output = tokenizer.encode(test_sample, is_pretokenized=True)
>>> print(f"Text: {test_sample}")
>>> print(f"Tokens: {output.tokens}")
>>> print(f"IDs: {output.ids}")
Text: ['token_1', 'token_2_extra', 'token_5_extra', 'token_6_extra']
Tokens: ['token_1', '[UNK]', 'token_5', '[UNK]', 'token_6', '[UNK]']
IDs: [4, 0, 8, 0, 9, 0]
For regular tokens with extra text added to the end, the whole token gets mapped to [UNK]. Special tokens however get pulled out of the pretokenized string, splitting the string into two with the first part being the special token and extra text mapping to [UNK].
Is there a way to disable this behaviour? I would like for encode to behave the same for special and non-special tokens, simply mapping the whole pretokenized string to [UNK] if there is extra text.