Hi Hugging Face Community,
I have the following questions regarding special tokens:
-
Why doesn’t
tokenizer.all_special_tokens
include<image>
token? I’m using the LLaVA model that has<image>
as a special token (as defined inadded_tokens_decoder
of tokenizer_config.json). The tokenizer encodes and decodes it indeed as a special token. However, when I load in its tokenizer and calltokenizer.all_special_tokens
ortokenizer.additional_special_tokens
,<image>
token is not included. -
Where is
<image>
token loaded? I looked into thetokenizer.from_pretrained
function but there doesn’t seem to be a place to actually read in theadded_tokens_decoder
where in config file this special token is defined ? -
Where is
tokenizer.decode
function defined as a special token? I tried to break-point into it to find how it skipped<image>
as a special token but I seem to get into a loop call betweentokenization_utils_base.py
andtokenization_utils_fast.py
It would be really helpful if you give an answer to any of these questions. Thank you very much!