Evaluating my own model on AGIEval or MMLU benchmarks

I have recently trained from scratch a GPT-2 model. I now want to evaluate its performance on two popular benchmarks that assess the general intelligence of a model: AGIEval and MMLU. How do I do this?