Trainer.evaluate() with text generation

Is there any update regarding this topic?
I would like to train a VisionEncoderDecoderModel for image captioning and measure the BLEU metrics during evaluation. The EvalPrediction object I get in compute_metrics just contains the logits, not the generated texts or tokens (i.e. the result of a beam search). I would assume that the computation of metrics on the result of generate is not uncommon.

The PR mentioned in this thread seems to be stale and there have been quite some changes to Trainer since it was proposed.