I’m trying to fine-tune LLaMA, and I want to evalute both the eval_loss
and the bleu
score during training, where the former needs teacher-forcing while the latter does not.
I find the Seq2SeqTrainer
simultaneously executes model(**inputs).loss
to compute evaluation loss and model.generate(**inputs)
to compute generated tokens.
However, the inputs
are different across the two function calls, as model(**inputs).loss
requires the inputs to include labels
while model.generate(**inputs)
requires the inputs not to include labels
.
Take the sentence I love you, do you
as an example. I train LLaMA with I love you,
as context and do you
as target. When evaluating, I want to inspect:
log[p(do|I love you,)] + log[p(you|I love you, do)]
, i.e.eval_loss
, which requiresI love you, do you
asinput_ids
- the generation results from
p(*|I love you,)
, and computingbleu
score based on the generated results. This only requiresI love you,
asinput_ids
.
How to resolve such a conflict so as to evaluate both eval_loss
and bleu
score in Seq2SeqTrainer.prediction_step()
?