Skew between mistral prompt in docs vs. chat template

hamel · December 22, 2023, 12:35am

For mistral, the chat template will apply a space between <s> and [INST], whereas the documentation doesn’t have this.

See these docs vs this code:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

chat = [{"role": "user", "content": "abc"}, 
        {"role": "assistant", "content": "ipsum"}, 
        {"role": "user", "content": "123"}, 
        {"role": "assistant", "content": "sit"}]

ids = tokenizer.apply_chat_template(chat)
print(tokenizer.decode(ids))

Which will result in this:

<s> [INST] abc [/INST]ipsum</s> [INST] 123 [/INST]sit</s>

osanseviero · December 27, 2023, 12:00pm

This is not related to the chat template, but how decoding works in this tokenizer and start of words. Let’s go step by step.

Your example was

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

chat = [{"role": "user", "content": "abc"}, 
        {"role": "assistant", "content": "ipsum"}, 
        {"role": "user", "content": "123"}, 
        {"role": "assistant", "content": "sit"}]

ids = tokenizer.apply_chat_template(chat)
print(tokenizer.decode(ids))

which returns <s> [INST] abc [/INST]ipsum</s> [INST] 123 [/INST]sit</s>

Now, let’s not use chat templates and pass the raw string to the tokenizer (we don’t add beginning of sentence token as that’s added automatically - tokenizer.add_bos_token is True)

ids = tokenizer("[INST] abc [/INST]ipsum</s> [INST] 123 [/INST]sit</s>")["input_ids"]
print(tokenizer.decode(ids))

We get, once again, the same string with extra space <s> [INST] abc [/INST]ipsum</s> [INST] 123 [/INST]sit</s>

So what’s going on? Within decode, we call convert_ids_to_tokens, so let’s see what’s going on there

tokenizer.convert_ids_to_tokens(ids)

will give

['<s>',
 '▁[',
 'INST',
 ']',
 '▁abc',
 '▁[',
 '/',
 'INST',
 ']',
 'ip',
 'sum',
 '</s>',
 '▁[',
 'INST',
 ']',
 '▁',
...

So that’s where the extra space is coming from. Ok, let’s dive deeper! What does convert_ids_to_tokens do? This is the exact implementation (minus docstrings/typing)

def convert_ids_to_tokens(
        self, ids, skip_special_tokens: False) 
        tokens = []
        for index in ids:
            index = int(index)
            if skip_special_tokens and index in self.all_special_ids:
                continue
            tokens.append(self._tokenizer.id_to_token(index))
        return tokens

So let’s replicate it

for index in ids:
  print(index, tokenizer._tokenizer.id_to_token(index))

Will yield same as above

1 <s>
733 ▁[
16289 INST
28793 ]
18641 ▁abc
733 ▁[
28748 /
16289 INST
28793 ]
508 ip
1801 sum
2 </s>
28705 ▁
733 ▁[
...

So the [ token is encoded to ID 733, which is decoded as _[. This actually makes sense. You can go to the tokenizer vocabulary and will find that the ID 733 indeed is _[ https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/raw/main/tokenizer.json

So finally, the main question is why there’s an _. The _ is usually added for start of the word.

hamel · December 27, 2023, 3:39pm

Thanks @osanseviero for being helpful as usual. I think my confusion comes from trying to decode the chat template which adds the space and compare that to the documentation which doesn’t have a space.

It turns out that encoding the documented example without the space results in the same token ids as encoding the chat template.

However, you should NOT try to take the decoded chat template and use that as your prompt template! I think this is very confusing and many people are making mistakes here. I think the main principle is you CANNOT do a round trip from a decoded example back to an encoded one.

Topic		Replies	Views
Tokenizer.apply_chat_template vs formatting_func Beginners	0	52	July 2, 2024
Modifying normalizer for pretrained tokenizers don't consistently work 🤗Tokenizers	2	102	June 12, 2024
Llama2 tokenizer duplicate ids Beginners	2	1160	April 21, 2024
Data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 6952 column 3 Models	1	687	July 4, 2024
Different Behaviors between Tokenizers for Question Answering 🤗Transformers	0	330	October 20, 2021

Skew between mistral prompt in docs vs. chat template

Related Topics