Question about llama fine tuning dataset token string

John6666 · May 17, 2025, 6:38am

Not limited to <>, special characters that have no meaning in themselves should not be passed unless there is a specific intention to do so. Generally, it is better to have as little noise as possible in data. For chat or writing models, it is better not to pass non-conversational character strings. (This does not apply when <> has a specific meaning, such as the emoticon >_<.)

However, when training a model, Chat Templates or special tokens are important, but their assignment should be automated to some extent by the Tokenizer. (This is the default behavior unless explicitly specified.)

Topic		Replies	Views
SFT Trainer and chat templates Beginners	3	339	March 26, 2025
Issue with LLaMA-3 Fine-Tuning: Model Generates Correct Answer but Then Adds Unrelated Questions 🤗AutoTrain	5	300	April 8, 2025
Best practice for usage of Data Collator For CompletionOnlyLM in multi-turn chat 🤗Transformers	2	665	May 25, 2025
Fine Tuning with Alpaca vs Chat Template Beginners	0	510	December 12, 2024
Llama-2-7b-chat fine-tuning Models	4	6759	April 26, 2024

Question about llama fine tuning dataset token string

Related topics