Unlabeled entries in evaluation subset?

The tutorial for prompt based tuning Prompt-based methods uses twitter_complaints subset of ought/raft. But from what I can see all of the entries in the test split are Unlabeled.

How are the evaluation metrics supposed to work in that case? If the trained model correctly predicts the label, it will be compared with Unlabeled ground truth, so the metrics will become meaningless.

Or is there something happening with the evaluation that I don’t quite get?