This is not related to the chat template, but how decoding works in this tokenizer and start of words. Letâs go step by step.
Your example was
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
chat = [{"role": "user", "content": "abc"},
{"role": "assistant", "content": "ipsum"},
{"role": "user", "content": "123"},
{"role": "assistant", "content": "sit"}]
ids = tokenizer.apply_chat_template(chat)
print(tokenizer.decode(ids))
which returns <s> [INST] abc [/INST]ipsum</s> [INST] 123 [/INST]sit</s>
Now, letâs not use chat templates and pass the raw string to the tokenizer (we donât add beginning of sentence token as thatâs added automatically - tokenizer.add_bos_token
is True
)
ids = tokenizer("[INST] abc [/INST]ipsum</s> [INST] 123 [/INST]sit</s>")["input_ids"]
print(tokenizer.decode(ids))
We get, once again, the same string with extra space <s> [INST] abc [/INST]ipsum</s> [INST] 123 [/INST]sit</s>
So whatâs going on? Within decode
, we call convert_ids_to_tokens
, so letâs see whatâs going on there
tokenizer.convert_ids_to_tokens(ids)
will give
['<s>',
'â[',
'INST',
']',
'âabc',
'â[',
'/',
'INST',
']',
'ip',
'sum',
'</s>',
'â[',
'INST',
']',
'â',
...
So thatâs where the extra space is coming from. Ok, letâs dive deeper! What does convert_ids_to_tokens
do? This is the exact implementation (minus docstrings/typing)
def convert_ids_to_tokens(
self, ids, skip_special_tokens: False)
tokens = []
for index in ids:
index = int(index)
if skip_special_tokens and index in self.all_special_ids:
continue
tokens.append(self._tokenizer.id_to_token(index))
return tokens
So letâs replicate it
for index in ids:
print(index, tokenizer._tokenizer.id_to_token(index))
Will yield same as above
1 <s>
733 â[
16289 INST
28793 ]
18641 âabc
733 â[
28748 /
16289 INST
28793 ]
508 ip
1801 sum
2 </s>
28705 â
733 â[
...
So the [
token is encoded to ID 733, which is decoded as _[
. This actually makes sense. You can go to the tokenizer vocabulary and will find that the ID 733 indeed is _[
https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/raw/main/tokenizer.json
So finally, the main question is why thereâs an _
. The _
is usually added for start of the word.