The tutorial for prompt based tuning Prompt-based methods uses twitter_complaints
subset of ought/raft. But from what I can see all of the entries in the test split are Unlabeled
.
How are the evaluation metrics supposed to work in that case? If the trained model correctly predicts the label, it will be compared with Unlabeled
ground truth, so the metrics will become meaningless.
Or is there something happening with the evaluation that I don’t quite get?