Question about llama fine tuning dataset token string

kbuwel · May 16, 2025, 9:00pm

Sorry. No matter how many times I press the reply and reply buttons with the Enter and Spacebar, the editing window where I can write a post does not appear, so I had no choice but to open a new topic.
In my previous post, I uploaded the code and asked how to fine-tune it well.
And I modified it according to the answer.
Now, I am experimenting with how to structure the dataset to get good results.
In my previous post, I showed an example.
I received an answer that special characters should not be included when fine-tuning.
Here, I have a little question.
gpt3 recommended ~~and~~ to me.
It said to use ~~when starting the body and~~ when the body ends.
However, in my previous post, it said not to use <>.
According to the explanation in gpt3, you need to input ~~and~~ so that the model can know the beginning and end of the body of the novel data.
Should I not use ~~and~~ ?
Can I just input the body without using this string?

John6666 · May 17, 2025, 6:38am

Not limited to <>, special characters that have no meaning in themselves should not be passed unless there is a specific intention to do so. Generally, it is better to have as little noise as possible in data. For chat or writing models, it is better not to pass non-conversational character strings. (This does not apply when <> has a specific meaning, such as the emoticon >_<.)

However, when training a model, Chat Templates or special tokens are important, but their assignment should be automated to some extent by the Tokenizer. (This is the default behavior unless explicitly specified.)

Topic		Replies	Views
SFT Trainer and chat templates Beginners	3	629	March 26, 2025
Fine Tuning with Alpaca vs Chat Template Beginners	0	670	December 12, 2024
Best practice for usage of Data Collator For CompletionOnlyLM in multi-turn chat 🤗Transformers	2	997	May 25, 2025
Cannot use apply_chat_template() because tokenizer.chat_template is not set Beginners	6	4786	December 1, 2024
Llama-2-7b-chat fine-tuning Models	4	6852	April 26, 2024

Question about llama fine tuning dataset token string

Related topics