Fine-tune Llama2 evaluation

Hello,
I want to fine-tune Llama2 for a specific text generation task, I would like to get the model outputs for my evaluation dataset in text form to be able to perform custom metrics. It looks like I can use the compute_metrics function or I can use generate to get the output of my evaluation. But I’ve got a problem: when I decode my outputs, they’re absolutely not the same for the two methods, why?