Skew between mistral prompt in docs vs. chat template

See this

For mistral, the chat template will apply a space between <s> and [INST], whereas the documentation doesn’t have this.

See these docs vs this code:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

chat = [{"role": "user", "content": "abc"}, 
        {"role": "assistant", "content": "ipsum"}, 
        {"role": "user", "content": "123"}, 
        {"role": "assistant", "content": "sit"}]

ids = tokenizer.apply_chat_template(chat)
print(tokenizer.decode(ids))

Which will result in this:

<s> [INST] abc [/INST]ipsum</s> [INST] 123 [/INST]sit</s>

This is not related to the chat template, but how decoding works in this tokenizer and start of words. Let’s go step by step.

Your example was

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

chat = [{"role": "user", "content": "abc"}, 
        {"role": "assistant", "content": "ipsum"}, 
        {"role": "user", "content": "123"}, 
        {"role": "assistant", "content": "sit"}]

ids = tokenizer.apply_chat_template(chat)
print(tokenizer.decode(ids))

which returns <s> [INST] abc [/INST]ipsum</s> [INST] 123 [/INST]sit</s>

Now, let’s not use chat templates and pass the raw string to the tokenizer (we don’t add beginning of sentence token as that’s added automatically - tokenizer.add_bos_token is True)

ids = tokenizer("[INST] abc [/INST]ipsum</s> [INST] 123 [/INST]sit</s>")["input_ids"]
print(tokenizer.decode(ids))

We get, once again, the same string with extra space <s> [INST] abc [/INST]ipsum</s> [INST] 123 [/INST]sit</s>

So what’s going on? Within decode, we call convert_ids_to_tokens, so let’s see what’s going on there

tokenizer.convert_ids_to_tokens(ids)

will give

['<s>',
 '▁[',
 'INST',
 ']',
 '▁abc',
 '▁[',
 '/',
 'INST',
 ']',
 'ip',
 'sum',
 '</s>',
 '▁[',
 'INST',
 ']',
 '▁',
...

So that’s where the extra space is coming from. Ok, let’s dive deeper! What does convert_ids_to_tokens do? This is the exact implementation (minus docstrings/typing)

def convert_ids_to_tokens(
        self, ids, skip_special_tokens: False) 
        tokens = []
        for index in ids:
            index = int(index)
            if skip_special_tokens and index in self.all_special_ids:
                continue
            tokens.append(self._tokenizer.id_to_token(index))
        return tokens

So let’s replicate it

for index in ids:
  print(index, tokenizer._tokenizer.id_to_token(index))

Will yield same as above

1 <s>
733 ▁[
16289 INST
28793 ]
18641 ▁abc
733 ▁[
28748 /
16289 INST
28793 ]
508 ip
1801 sum
2 </s>
28705 ▁
733 ▁[
...

So the [ token is encoded to ID 733, which is decoded as _[. This actually makes sense. You can go to the tokenizer vocabulary and will find that the ID 733 indeed is _[ https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/raw/main/tokenizer.json

So finally, the main question is why there’s an _. The _ is usually added for start of the word.

1 Like

Thanks @osanseviero for being helpful as usual. I think my confusion comes from trying to decode the chat template which adds the space and compare that to the documentation which doesn’t have a space.

It turns out that encoding the documented example without the space results in the same token ids as encoding the chat template.

However, you should NOT try to take the decoded chat template and use that as your prompt template! I think this is very confusing and many people are making mistakes here. I think the main principle is you CANNOT do a round trip from a decoded example back to an encoded one.