`additional_special_tokens` are not added

Yao-Lirong · June 20, 2024, 12:55pm

Hi Hugging Face Community,

I have the following questions regarding special tokens:

Why doesn’t tokenizer.all_special_tokens include <image> token? I’m using the LLaVA model that has <image> as a special token (as defined in added_tokens_decoder of tokenizer_config.json). The tokenizer encodes and decodes it indeed as a special token. However, when I load in its tokenizer and call tokenizer.all_special_tokens or tokenizer.additional_special_tokens, <image> token is not included.
Where is <image> token loaded? I looked into the tokenizer.from_pretrained function but there doesn’t seem to be a place to actually read in the added_tokens_decoder where in config file this special token is defined ?
Where is tokenizer.decode function defined as a special token? I tried to break-point into it to find how it skipped <image> as a special token but I seem to get into a loop call between tokenization_utils_base.py and tokenization_utils_fast.py

It would be really helpful if you give an answer to any of these questions. Thank you very much!

Yao-Lirong · June 20, 2024, 12:58pm

I’m not allowed to paste more than 2 links in a post so I will provide the codes related to question 3 where I got into a loop.
tokenization_utils_base.py and tokenization_utils_fast.py

Topic		Replies	Views
How to add all standard special tokens to my tokenizer and model? Beginners	1	5895	August 11, 2022
Can't load tokenizer with added special tokens 🤗Transformers	0	824	March 29, 2022
Adding a new mask_token for BERT-like models/tokenizers Intermediate	0	544	May 26, 2023
How to determine if a token is special 🤗Tokenizers	2	39	April 29, 2025
Are special_tokens the only tokens guaranteed to be atomic? 🤗Tokenizers	0	374	March 3, 2021