What is the correct way to evaluate SQuAD v1.1?

Hi, what is the correct way to evaluate SQuADv1.1? I am running experiments on a quantized RoBERTa (I simulate this by modifying the forward pass of modeling_roberta.py) and generated predictions in two different ways. The first way is using the run_qa.py example script, and the second way is by getting dev examples with SquadV1Processor and iterating over the examples with QuestionAnsweringPipeline. I made sure that the hyperparameters are the same. I notice that the two methods generate different predictions, and when using pipeline, my EM / F1 score is 20% worse than the example script. Is one method more correct than the other? Thank you so much!