How to determine if a token is special

For example, for llama 3.1 tokenizer, "<|start_header_id|> is a special token but is not in all_special_tokens.

1 Like

It seems more complicated than I thought, or rather, it seems to be tangled up.

It’s not just the training that’s a black box; for intermediate learners, every step forward is a struggle.

Taking the model accessed via Unsloth as an example, you can find the token with id=“128006” (<|start_header_id|>) in the tokenizer.json file. One of its attributes is “special”: true.

There is a not-so-smart way to verify this: whether using llama-cpp or directly using transformers, during inference, set skip_special_tokens=False and print the output. However, this method might be more suitable when you are adding your own special tokens.
e.g. text_output = tokenizer.decode(_[0], skip_special_tokens=False)

However, what I’m not sure about is why it doesn’t appear in special_tokens_map.json. One explanation I’ve heard is that this file only contains certain special tokens, but I’m not completely sure.

1 Like