"chat_template": "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}"
To the best of my knowledge, the segment below adds an eos_token to the end of every conversation turn:
{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}
This is my first time working with multi-turn conversation data, and I am wondering why an eos_token is added to the end of every turn. Wouldn’t training on data like this give the model a mistaken understanding that text can be generated even after the eos_token?
Or does this not matter during inference because the LLM programmatically cuts generation once the eos_token has been generated?
The models have been trained with the EOS token positioned in this way to create the illusion of a multi-turn conversation. Therefore, sticking to the prompt template used during the instruction finetune will cause the model to also stick to its expected behaviour.
The EOS token then effectively helps the model to understand that whomever was speaking is now done.
In some examples I have found that omitting the EOS token in my query caused the model to attempt to complete my query. Whereas adding the EOS token caused the model to reply to my query.
As with all things, experiment an see what happens.
@swtb
It’s interesting that you noticed this difference:
I’ve been processing what you’ve said, and your explanation that the eos_token of the Llama-3.1-8B-Instruct model serves a different purpose from that of the Llama-3.1-8B base model does make intuitive sense to me.
As you’ve suggested, it would seem that structuring multi-turn conversations in this way would induce the model to learn that the eos_token marks the end of a turn, rather than the end of a model generation.
I’ve begun thinking that this may be why the base model and instruct models have different eos_token by default. The base model has the eos_token set to:
While the instruct model has the eos_token set to:
Perhaps this was an intentional choice by Meta, so as to avoid “overwriting” the information the Llama model had learned about the <|end_of_text|> token? By setting a different eos_token and ensuring that the chat_template made use of <|eot_id|>, perhaps they were able to preserve what was previously learned about the <|end_of_text|> token while inducing the behavior they desired.
If this interpretation seems off in any way, please let me know!
@Chahnwoo I think you are on the right lines. If I wanted to train certain behaviour into a model using special tokens and markers I would definitely use a new token rather than reusing and old one.
We for sure want to leverage the base models language understanding for the instruction tuning. Almost as an additive task that relies on the deeper knowledge in the pretraining.