How does SFTT trainer behave during evaluation?

I was wondering how does SFTT trainer handle the evaluation data with instructions that is passed to it.

Assume that we have a QA dataset and we format it in a way presented below.

def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['instruction'])):
        text = f"### Question: {example['instruction'][i]}\n ### Answer: {example['output'][i]}"
        output_texts.append(text)
    return output_texts

response_template = " ### Answer:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

Thanks to data collator we know that the loss will be only calculated for tokens appearing after the “### Answer” section of the instruction.

In the training paradigm, we use so called teacher forcing (using ground truth as input, instead of model output from a prior time step as an input.) which helps the loss to converge.

But is it the case in evaluation phase? Do we utilize ground truth as input instead of the generations of the model from previous steps? I guess yes, because without that evaluation loss would be really volatile, and I do not see this phenomenon. It would mean that we perform the same operations as on training set, but the update of the model’s parameters is not performed based on the performance on evaluation set.

1 Like