How to determine if a token is special

BoltzmachineQ · April 25, 2025, 4:54pm

For example, for llama 3.1 tokenizer, "<|start_header_id|> is a special token but is not in all_special_tokens.

John6666 · April 26, 2025, 5:55am

It seems more complicated than I thought, or rather, it seems to be tangled up.

github.com/huggingface/transformers

Special Tokens Not Working as Expected in Bert Tokenizer

opened 03:50PM - 14 Apr 22 UTC

closed 03:02PM - 23 May 22 UTC

bharathc346

## Environment info - `transformers` version: 4.18.0 - Platform: macOS-12.0.1-…arm64-arm-64bit - Python version: 3.8.13 - Huggingface_hub version: 0.4.0 - PyTorch version (GPU?): 1.11.0 (False) - Tensorflow version (GPU?): not installed (NA) - Flax version (CPU?/GPU?/TPU?): not installed (NA) - Jax version: not installed - JaxLib version: not installed - Using GPU in script?: No - Using distributed or parallel set-up in script?: No ### Who can help Tokenizers: @SaulLu ## Information Using the BERT tokenizer and wanted to add my own special tokens, but am not getting the expected behavior ## To reproduce ``` tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") n_added_tokens = tokenizer.add_tokens(["<CONCEPT>", "</CONCEPT>"], special_tokens=True) print(n_added_tokens) context = "Hi I am taking <CONCEPT> Advil </CONCEPT>" context_tokens = tokenizer(context).input_ids print(tokenizer.convert_ids_to_tokens(context_tokens)) context = "<CONCEPT>" context_tokens = tokenizer(context).input_ids print(tokenizer.convert_ids_to_tokens(context_tokens)) ``` **Output:** ``` 2 ['[CLS]', 'hi', 'i', 'am', 'taking', '<', 'concept', '>', 'ad', '##vil', '<', '/', 'concept', '>', '[SEP]'] ['[CLS]', '<CONCEPT>', '[SEP]'] ``` ## Expected behavior **Expected Output:** ``` 2 ['[CLS]', 'hi', 'i', 'am', 'taking', '<CONCEPT>', 'ad', '##vil', '</CONCEPT>', '[SEP]'] ['[CLS]', '<CONCEPT>', '[SEP]'] ``` Seems like when we pass the special token alone it somehow works, but not in more text .

e950280 · April 29, 2025, 2:53am

It’s not just the training that’s a black box; for intermediate learners, every step forward is a struggle.

Taking the model accessed via Unsloth as an example, you can find the token with id=“128006” (<|start_header_id|>) in the tokenizer.json file. One of its attributes is “special”: true.

There is a not-so-smart way to verify this: whether using llama-cpp or directly using transformers, during inference, set skip_special_tokens=False and print the output. However, this method might be more suitable when you are adding your own special tokens.
e.g. text_output = tokenizer.decode(_[0], skip_special_tokens=False)

However, what I’m not sure about is why it doesn’t appear in special_tokens_map.json. One explanation I’ve heard is that this file only contains certain special tokens, but I’m not completely sure.

Topic		Replies	Views
Tokenizer is splitting special token 🤗Tokenizers	3	18	June 30, 2025
transformers.Tokenizer produce unexpected results 🤗Transformers	0	208	April 26, 2023
`add_tokens` with argument `special_tokens=True` vs `add_special_tokens` 🤗Tokenizers	0	361	April 5, 2023
`additional_special_tokens` are not added 🤗Tokenizers	1	438	June 20, 2024
Maybe there is a bug in BertTokenizer? 🤗Transformers	0	380	March 19, 2021

How to determine if a token is special

Related topics