Quality of model generating text

I want to ensure that my model has learned intrinsic patterns during training on some NPC phrases from a game.
But I don’t know what metrics to use. ROUGE and BLUE seem to be prone to overfitting when the model just repeats the original text. Otherwise, the novel outputs may not carry the personal traits of the game character, and it is hard to understand whether the outputs are similar to the original dataset.
What are the recommended metrics for this kind of task?