How to evaluate T5 on classification task in case of multiple tasks

I joined multiple datasets likes translation + classification and made one big dataset to train T5 on it, with max_target_length = 128 for instance, and then the classification I have is boolq, with True=2 tokens, False=4 tokens. When I want to evaluate the trained model on boolq, how should I form the metric? Can I reset the max_length = 4 during evaluation? How should I measure the accuracy in this case? the model sometimes generates sequences as output, instead of exact True/False, any preprocessing might be needed before doing evaluation? thanks