Dataset format which will be given SFTTrainer

tjrlwjd1 · June 16, 2024, 2:35pm

As I mentioned in title, I’m curious what is exact format of dataset which will be given SFTTrainer?

I want to finetune a LLM in Code Generation task.
Below is the my ideal situation.

When model is given the below:
{Natural Language Problem(e.g. Code competition problem),
I want to make model answer(code) like the below:
{Solution Code}.

I made the dataset(Dataset class) with columns (“prompt”, “completion”).
Then, I look at train_dataset after pass SFTTrainer.

<start token for tokenizer> <INST> {Natural Language Problem} </INST> {Solution Code}\ .

I am wondering that is it wrong??
If it is, please let me know how to make a dataset, especially column names.

In addition, I tried to use DataCollatorForCompletionOnly. But it didn’t work (always train loss = 0)

Topic		Replies	Views
SFT Trainer and chat templates Beginners	3	392	March 26, 2025
How to choose dataset_text_field in SFTTrainer hugging face for my LLM model Models	1	554	August 27, 2024
Best practice for usage of Data Collator For CompletionOnlyLM in multi-turn chat 🤗Transformers	2	703	May 25, 2025
SFTTrain and datasets, my head hurts Beginners	2	95	February 18, 2025
Whats happening in the SFT trainer? Beginners	15	2543	July 16, 2025

Dataset format which will be given SFTTrainer

Related topics