It’s not just the training that’s a black box; for intermediate learners, every step forward is a struggle.
Taking the model accessed via Unsloth as an example, you can find the token with id=“128006” (<|start_header_id|>) in the tokenizer.json file. One of its attributes is “special”: true.
There is a not-so-smart way to verify this: whether using llama-cpp or directly using transformers, during inference, set skip_special_tokens=False and print the output. However, this method might be more suitable when you are adding your own special tokens.
e.g. text_output = tokenizer.decode(_[0], skip_special_tokens=False)
However, what I’m not sure about is why it doesn’t appear in special_tokens_map.json. One explanation I’ve heard is that this file only contains certain special tokens, but I’m not completely sure.