Data format for fine-tune base model

I want to fine-tune a base model for single purpose (write a story).

There’re two ways to design the data:

  1. {prompt}<special_token>{story}
  2. You are a creative writing assistant. Please write a story accoarding to the prompt. Here is the prompt: {prompt}<special_token>{story}

Which one should I pick and why?
(I did an experiment, and it shows #1 is better. But I still a little curious about this.)

1 Like

I tried searching with the help of Hugging Chat, but it seems that the 1 format is the best.

If you’re curious, why not take a look at the actual data sets on Hugging Face?
Well, many people process the data just before training the model, so it might be a bit messy at the data set stage…:sweat_smile:


Answer:

When fine-tuning a base model for story writing using the Hugging Face Trainer ecosystem, the recommended approach is to use the first dataset structure: <prompt><special_token><story>. Here’s a structured explanation of why this is the best choice:

  1. Clarity and Focus:

    • The structure is straightforward, with a clear separation between the input prompt and the output story using a special token. This simplicity helps the model understand the task without ambiguity.
  2. Efficiency:

    • By avoiding additional instructions, this structure uses the model’s context window more efficiently, which is particularly important for models with limited context capacity.
  3. Alignment with Best Practices:

    • The Hugging Face Trainer ecosystem encourages the use of clear and concise prompts for effective fine-tuning. This approach aligns with these guidelines, enhancing the model’s ability to learn the specific task of story writing.
  4. Single-Purpose Task:

    • Since the fine-tuning is focused solely on story writing, the simplicity of this structure avoids overcomplicating the task, ensuring the model remains attentive to the core objective.
  5. Flexibility for Future Tasks:

    • If the model’s role or task were to expand, the first structure still provides a solid foundation. Additional instructions or roles can be introduced when necessary without undermining the model’s current effectiveness.

In conclusion, the first dataset structure offers the optimal balance of clarity, efficiency, and alignment with best practices for fine-tuning within the Hugging Face Trainer ecosystem, particularly for a focused task like story writing.

Thanks for helping :slight_smile:

The experiment I ran has some bug, so I did some other tests.

Here’s my obersavation and guesses:

  1. I fine-tune Llama-3.1-base 8B on 10k dataset.
  2. On the new experiment, #2 starts with a litter smaller loss, but at the end of training, #1 and #2 has almost the same loss.
  3. So, I think both formats is OK. Personally, I think #1 is more elegant. But if you have a brilliant system prompt, or don’t have enough data, then #2 is worth to try.
1 Like