What are the Latest Methods to Evaluate Instruction-Tuned Model on a Custom Test Set?

Hello everyone, I’m wondering if there’s any progress on how to automatically evaluate instruction-tuned models for custom datasets. (i.e. how to compare the ability to follow instructions among models). Currently, I’m finetuning a model and want to evaluate it on the self-instruct test set.

What I vaguely know about is either having a human or a better model act as judge, but I wonder if there’s any new development. References to papers are certainly welcome. Thank you.