Hi everyone,
I’m trying to reach the reported leaderboard results of Longformer (from the paper), and I am struggling.
Steps that I took:
- I downloaded TriviaQA’s original dev set.
- I’m using LongformerForQuestionAnswering for evaluation.
- I normalize the predicted answers and compare them to the gold-label answers to compute ExactMatch.
Am I missing something? Should any further processing be done before evaluating with LongformerForQuestionAnswering?
I already looked at the Github repo of Longformer, it doens’t seem like they do any additional preprocessing to the dev data/context.