Evluation Metric for LLM output generation

I have fine-tuned a model for replying in a specific tone (example pirate tone or on the data of a specfic brands marketing material). How do i check / evaluate if my model is following the intended tone, Is there any evaluation metric (like rouge, perplexity) that can be calculated to check for this type of use case Please help.