What is the correct way to evaluate SQuAD v1.1?

jamcheung · February 8, 2022, 3:52am

Hi, what is the correct way to evaluate SQuADv1.1? I am running experiments on a quantized RoBERTa (I simulate this by modifying the forward pass of modeling_roberta.py) and generated predictions in two different ways. The first way is using the run_qa.py example script, and the second way is by getting dev examples with SquadV1Processor and iterating over the examples with QuestionAnsweringPipeline. I made sure that the hyperparameters are the same. I notice that the two methods generate different predictions, and when using pipeline, my EM / F1 score is 20% worse than the example script. Is one method more correct than the other? Thank you so much!

Topic		Replies	Views
ValueError when using `run_qa.py` to evaluate model Beginners	1	1515	December 10, 2022
How to evaluate models Beginners	0	2848	June 16, 2021
Cannot fine-tune RobertaForQA on SQuAD-like dataset? Beginners	0	273	November 15, 2021
(Distributed Training) KeyError: eval_f1 in QuestionAnsweringTrainer taken from trainer_qa.py in examples 🤗Transformers	1	1192	June 22, 2023
Evaluating QA model on single SQuAD file Beginners	1	730	June 7, 2021

What is the correct way to evaluate SQuAD v1.1?

Related topics