Can I compute `eval_loss` and `bleu` score simultaneously for decoder only transformers

I’m trying to fine-tune LLaMA, and I want to evalute both the eval_loss and the bleu score during training, where the former needs teacher-forcing while the latter does not.

I find the Seq2SeqTrainer simultaneously executes model(**inputs).loss to compute evaluation loss and model.generate(**inputs) to compute generated tokens.

However, the inputs are different across the two function calls, as model(**inputs).loss requires the inputs to include labels while model.generate(**inputs) requires the inputs not to include labels.

Take the sentence I love you, do you as an example. I train LLaMA with I love you, as context and do you as target. When evaluating, I want to inspect:

  1. log[p(do|I love you,)] + log[p(you|I love you, do)], i.e. eval_loss, which requires I love you, do you as input_ids
  2. the generation results from p(*|I love you,), and computing bleu score based on the generated results. This only requires I love you, as input_ids.

How to resolve such a conflict so as to evaluate both eval_loss and bleu score in Seq2SeqTrainer.prediction_step()?

