I’m trying to fine-tune LLaMA, and I want to evalute both the eval_loss and the bleu score during training, where the former needs teacher-forcing while the latter does not.
I find the Seq2SeqTrainer simultaneously executes model(**inputs).loss to compute evaluation loss and model.generate(**inputs) to compute generated tokens.
However, the inputs are different across the two function calls, as model(**inputs).loss requires the inputs to include labels while model.generate(**inputs) requires the inputs not to include labels.
Take the sentence I love you, do you as an example. I train LLaMA with I love you, as context and do you as target. When evaluating, I want to inspect:
log[p(do|I love you,)] + log[p(you|I love you, do)], i.e.eval_loss, which requiresI love you, do youasinput_ids- the generation results from
p(*|I love you,), and computingbleuscore based on the generated results. This only requiresI love you,asinput_ids.
How to resolve such a conflict so as to evaluate both eval_loss and bleu score in Seq2SeqTrainer.prediction_step()?