I have some confusion as to when to transform a dataset using model specific chat templates. Is it always advisable to apply data to a model specific chat template before training on that data? For instance Alpaca datasets are often formatted with a formatting function like the following. However, the code below doesn’t apply a model specific chat template, but more of a generic one. However, chat or instruct models are trained on very specific templates. Wouldn’t using something like the code below just confuse the model? Yet for instance the SFT tutorial recommends this specific function? I feel like I’m getting mixed messages.
def formatting_prompts_func(examples):
output_text =
for i in range(len(examples[“instruction”])):
instruction = examples[“instruction”][i]
input_text = examples[“input”][i]
response = examples[“output”][i]
if len(input_text) >= 2:
text = f'''Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{input_text}
### Response:
{response}
'''
else:
text = f'''Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Response:
{response}
'''
output_text.append(text)
return output_text