As I mentioned in title, I’m curious what is exact format of dataset which will be given SFTTrainer?
I want to finetune a LLM in Code Generation task.
Below is the my ideal situation.
When model is given the below:
{Natural Language Problem(e.g. Code competition problem),
I want to make model answer(code) like the below:
{Solution Code}.
I made the dataset(Dataset class) with columns (“prompt”, “completion”).
Then, I look at train_dataset after pass SFTTrainer.
<start token for tokenizer> <INST> {Natural Language Problem} </INST> {Solution Code}\ .
I am wondering that is it wrong??
If it is, please let me know how to make a dataset, especially column names.
In addition, I tried to use DataCollatorForCompletionOnly. But it didn’t work (always train loss = 0)