Perplexity on generated samples or directly on test set?

I have a fined tuned casual model to be used as a chatbot. Both train and test set consists of a prompt- and answer column. During finetuning the prompt and answer column is concatenated like “prompt:answer”, then tokenized.

For testing I have generated response answers from samples only from the prompt coloumn in the test set. These answers are generated with different search techniques (greedy, beam, sampling, top-k etc …)

What would be the correct way of calculating the finetuned models’ perplexity? Either directly on the test set, or on the generated samples from the test set?

As I understand perplexity measure in short, by using the cross entropy loss, how well the model fits the dataset. So by calculating it directly on the test set would be more accurate, rather than already “seen” generated samples?