Word level tokenizer pulls special tokens out of pretokenized strings

I am trying to train a word level tokenizer with a few special tokens. I want for these special tokens to have a mapping to an id, regardless of whether they appear in the training samples for the tokenizer (this is the deault behaviour for special tokens). I train the tokenizer like so

>>> import pandas as pd
>>> import numpy as np
>>>
>>> from tokenizers import Tokenizer
>>> from tokenizers.models import WordLevel
>>> from tokenizers.trainers import WordLevelTrainer
>>>
>>> train_samples = pd.Series([
...     np.array(["token_1", "token_2", "token_3"]),
...     np.array(["token_4", "token_5", "token_6"]),
... ])
>>>
>>> special_tokens = [
...     "token_5", "token_6", "token_100"
... ]
>>> tokenizer = Tokenizer(WordLevel(unk_token="[UNK]"))
>>> trainer = WordLevelTrainer(special_tokens=["[UNK]"] + special_tokens)
>>> tokenizer.train_from_iterator(train_samples, trainer)
>>> tokenizer.get_vocab()

{'token_2': 5,
 'token_4': 7,
 'token_3': 6,
 'token_1': 4,
 'token_100': 3,
 'token_6': 9,
 'token_5': 8,
 '[UNK]': 0}

While encoding a few test samples containing special, non-special and unseen strings, here is what i get

>>> test_sample = [
...     "token_1",
...     "token_2_extra",
...     "token_5_extra",
...     "token_6_extra",
... ]
>>> output = tokenizer.encode(test_sample, is_pretokenized=True)
>>> print(f"Text: {test_sample}")
>>> print(f"Tokens: {output.tokens}")
>>> print(f"IDs: {output.ids}")

Text: ['token_1', 'token_2_extra', 'token_5_extra', 'token_6_extra']
Tokens: ['token_1', '[UNK]', 'token_5', '[UNK]', 'token_6', '[UNK]']
IDs: [4, 0, 8, 0, 9, 0]

For regular tokens with extra text added to the end, the whole token gets mapped to [UNK]. Special tokens however get pulled out of the pretokenized string, splitting the string into two with the first part being the special token and extra text mapping to [UNK].

Is there a way to disable this behaviour? I would like for encode to behave the same for special and non-special tokens, simply mapping the whole pretokenized string to [UNK] if there is extra text.

1 Like

For example like this?

# After training tokenizer
from tokenizers.pre_tokenizers import Whitespace
tokenizer._tokenizer.pre_tokenizer = Whitespace()

I tried this and it didn’t work. In general I would like it to work for special tokens with any extra text added to it, i.e. when i try to encode a token that has a special token as a substring of it, regardless of if there is a whitespace or not.

This behaviour is already enabled for non special tokens, so just wondering if there was a way to do the same for special tokens.

1 Like

Hey gursi26! Interesting issue with the word-level tokenizer splitting special tokens. To make it treat special and non-special tokens the same (mapping modified strings to [UNK]), try excluding the special tokens from the WordLevelTrainer’s special_tokens list and add them as regular tokens during training. This might stop the splitting behavior. Have you experimented with the trainer’s config or checked the Hugging Face docs for token splitting options?

1 Like