Tokenizer is splitting special token

torch-nn · June 30, 2025, 2:43am

I’m trying to recreate the BERTweet tokenizers for my own use using the tokenizers library. Here is my code so far (training data was obtained from the Kaggle “Natural Language Processing with Disaster Tweets” Competition):

import tokenizers
from tokenizers import normalizers, pre_tokenizers, processors, trainers, Tokenizer, models
from tokenizers.trainers import WordPieceTrainer
from tokenizers.normalizers import NFD, NFC, StripAccents, BertNormalizer, Replace, Lowercase
from tokenizers.pre_tokenizers import Whitespace, Punctuation, Split
from transformers import AutoTokenizer
from tokenizers.processors import TemplateProcessing

tokenizer_df = pd.concat([full_df, test_df], axis=0) # concatting all data points

tokenizer = Tokenizer(
    models.WordPiece(unk_token='[UNK]')
)

# Normalization
tokenizer.normalizer = normalizers.Sequence([
    Replace(tokenizers.Regex(r"http\S+|www\.\S+"), "HTTPURL"),
    Replace(tokenizers.Regex(r"@\w+"), "@USER"), # From the BERTweet Paper
    NFD(),
    StripAccents(),
    Replace(r"\s+", " "), # Collapsing whitespace
])

# Pre-tokenization (provides the “legal cut points” that the sub-word encoder may merge inside, but never across)
tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
    Split(tokenizers.Regex(r'HTTPURL'), behavior="isolated"), # makes HTTPURL and @USER are standalone tokens (cannot be split)
    Split(tokenizers.Regex(r'@USER'), behavior='isolated'),
    Split(tokenizers.Regex(r"#\w+"), behavior="isolated"), # isolate by hashtags
    Punctuation("isolated"), # isolate by punctuation marks
    Whitespace(), # Split based on spaces
])

special_tokens = ["[CLS]", "[PAD]", "[SEP]", "[MASK]", "[UNK]", 'HTTPURL', '@USER']
trainer = WordPieceTrainer(
    vocab_size=8000,
    special_tokens=special_tokens,
)

# Model Training
tokenizer.train_from_iterator(tokenizer_df['text'], trainer=trainer) # 10K training points

cls_token_id = tokenizer.token_to_id('[CLS]')
sep_token_id = tokenizer.token_to_id('[SEP]')

# Post-processing
tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)],
)

After training, I tested it on a bit of text:

from transformers import PreTrainedTokenizerFast

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,     # <- your trained WordPiece, untouched
    unk_token="[UNK]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    pad_token="[PAD]",
    mask_token="[MASK]",
    additional_special_tokens=['HTTPURL', '@USER'],
)
text_input = "@RockyFire Update => California Hwy. 20 closed in both directions due to Lake County fire - #CAfire"
encoded_input = wrapped_tokenizer(text_input)
print(encoded_input)
print(wrapped_tokenizer.convert_ids_to_tokens(encoded_input['input_ids']))

The following is the output from the last print statement:

['[CLS]', '@', 'USER', 'Update', '=', '>', 'California', 'Hwy', '.', '20', 'closed', 'in', 'both', 'direct', '##ions', 'due', 'to', 'Lake', 'County', 'fire', '-', '#', 'CA', '##fire', '[SEP]']

Is it possible to make sure ‘@USER’ stays intact? Why is it getting split?

John6666 · June 30, 2025, 3:04am

It seems more reliable to use add_tokens() rather than additional_special_tokens.

github.com/huggingface/transformers

mT5 additional_special_tokens seems not work

opened 09:45AM - 22 Jan 21 UTC

closed 06:03AM - 01 Feb 21 UTC

PiggyFan

I want add some special tokens such as <POS> <CON_START> . But T5tokenizer/MT5t…okenizer both can't tokenize correctly after using additional_special_tokens parameter. It still split these special tokens to subwords. <img width="1015" alt="截圖 2021-01-22 下午5 18 00" src="https://user-images.githubusercontent.com/26171212/105473581-3d88d100-5cd8-11eb-8568-6fedd19513e2.png"> It works when using OpenAIGPTTokenizer additional_special_tokens parameter. It's clear that after declare additional_special_tokens parameter, OpenAIGPTTokenizer tokenize <POS> as one word rather split it. <img width="979" alt="截圖 2021-01-22 下午5 54 57" src="https://user-images.githubusercontent.com/26171212/105475970-049e2b80-5cdb-11eb-8470-576fd8f38999.png"> <img width="697" alt="截圖 2021-01-22 下午5 55 10" src="https://user-images.githubusercontent.com/26171212/105475992-0962df80-5cdb-11eb-9cae-205b57818e95.png"> The version of transformers is 4.2.2 And I'm not sure this problem is related with [issue624](https://github.com/google-research/text-to-text-transfer-transformer/issues/624) in T5 which talk about SentencePiece extra vocab. Thank you for your feedback

github.com/huggingface/tokenizers

Fast Tokenizer split special tokens when using my own vocab

opened 05:44AM - 22 Feb 21 UTC

closed 01:49AM - 25 Apr 24 UTC

jungwhank

Stale

Hi, I trained my own tokenizer using tokenizers library and want to use vocab …file with Fast Tokenizers. However, fast tokenizer split special tokens like below even `PreTrainedTokenizerFast` have special_tokens. ```python >>> tokenizer = ElectraTokenizerFast('my_vocab.txt', do_lower_case=False, model_max_length=512) >>> tokenizer PreTrainedTokenizerFast(name_or_path='', vocab_size=32000, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}) >>> tokenizer.tokenize("[MASK] is capital of France.", add_special_tokens=True) ['[CLS]', '[', 'MA', '##SK', ']', 'is', 'cap', '##ital', 'of', 'France', '.', '[SEP]'] ``` but when I `add_special_tokens` manually, it works well ```python >>> tokenizer = ElectraTokenizerFast('my_vocab.txt', do_lower_case=False, model_max_length=512) >>> special_tokens = { "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]" } >>> tokenizer.add_special_tokens(special_tokens) >>> tokenizer.tokenize("[MASK] is capital of France.", add_special_tokens=True) ['[CLS]', '[MASK]', 'is', 'cap', '##ital', 'of', 'France', '.', '[SEP]'] ``` Should I add special tokens again with `add_special_tokens` manually, even if `PreTrainedTokenizerFast` have default special tokens?

torch-nn · June 30, 2025, 3:15am

I just tried add_tokens (like below) and its still not working:

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,     
    unk_token="[UNK]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    pad_token="[PAD]",
    mask_token="[MASK]",
)
wrapped_tokenizer.add_tokens(['HTTPURL', '@USER'])

John6666 · June 30, 2025, 5:29am

Hmm… How about save_pretrained with from_pretrained way?

github.com/huggingface/tokenizers

PretrainedTokenizerFast from Tokenizer Does Not Keep The Same Properties?

opened 10:19PM - 17 Aug 21 UTC

closed 01:49AM - 31 Mar 24 UTC

FeryET

Stale

I can't seem to create a "PreTrainedTokenizerFast" object from my original `toke…nizers` tokenizer object that has the same proporties. This is the code for a byte pair tokenizer I have experimented on. The resulting fast tokenizer does not have a [PAD] token, and does not have any special tokens at all. ``` tokenizer = ByteLevelBPETokenizer() tokenizer.preprocessor = pre_tokenizers.BertPreTokenizer() tokenizer.normalizer = normalizers.BertNormalizer() tokenizer.train_from_iterator(docs, vocab_size=16_000, min_frequency=15, special_tokens = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]) tokenizer._tokenizer.post_processor = processors.BertProcessing( ("[SEP]", tokenizer.token_to_id("[SEP]")), ("[CLS]", tokenizer.token_to_id("[CLS]")), ) tokenizer.enable_truncation(max_length=256) tokenizer.enable_padding(pad_id=3, pad_token="[PAD]") fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer) ``` The result of printing the fast_tokenizer is: ``` PreTrainedTokenizerFast(name_or_path='', vocab_size=16000, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', special_tokens={}) ``` Which `model_max_len` and `special_tokens` are wrong in it. Also, there is no pad_token and pad_token_id in the fast_tokenizer object. (the warning for pad_token for example: `Using pad_token, but it is not set yet.`) Have I done anything wrong, or is this not supposed to happen? The versions of libraries I'm using: ``` ['tokenizers 0.10.3', 'transformers 4.10.0.dev0'] ```

Topic		Replies	Views
`add_tokens` with argument `special_tokens=True` vs `add_special_tokens` 🤗Tokenizers	0	359	April 5, 2023
How to determine if a token is special 🤗Tokenizers	2	42	April 29, 2025
Maybe there is a bug in BertTokenizer? 🤗Transformers	0	380	March 19, 2021
Regular tokens vs special tokens 🤗Tokenizers	5	3551	January 8, 2024
Are special_tokens the only tokens guaranteed to be atomic? 🤗Tokenizers	0	376	March 3, 2021

Tokenizer is splitting special token

Related topics