While using Colab with the inference code written I am getting the below results.
{
'exact': 31.272635391223783,
'f1': 35.63616173418905,
'total': 11873,
'HasAns_exact': 59.83468286099865,
'HasAns_f1': 68.57424903340527,
'HasAns_total': 5928,
'NoAns_exact': 2.7922624053826746,
'NoAns_f1': 2.7922624053826746,
'NoAns_total': 5945,
'best_exact': 50.07159100480081,
'best_exact_thresh': 0.0,
'best_f1': 50.07159100480081,
'best_f1_thresh': 0.0}
When we use Huggingface script for the evaluation script below we get better results. What things should I change in the colab code to move the EM and F1?
python run_qa.py \
--model_name_or_path /path/to/distilbert-squad2 \
--dataset_name squad_v2 \
--version_2_with_negative \
--do_eval \
--per_device_train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ./tmp
Eval Results:
02/25/2021 07:13:08 - INFO - __main__ - ***** Eval results *****
02/25/2021 07:13:08 - INFO - __main__ - HasAns_exact = 71.54183535762483
02/25/2021 07:13:08 - INFO - __main__ - HasAns_f1 = 78.03088635740741
02/25/2021 07:13:08 - INFO - __main__ - HasAns_total = 5928
02/25/2021 07:13:08 - INFO - __main__ - NoAns_exact = 72.22876366694702
02/25/2021 07:13:08 - INFO - __main__ - NoAns_f1 = 72.22876366694702
02/25/2021 07:13:08 - INFO - __main__ - NoAns_total = 5945
02/25/2021 07:13:08 - INFO - __main__ - best_exact = 71.88579129116482
02/25/2021 07:13:08 - INFO - __main__ - best_exact_thresh = 0.0
02/25/2021 07:13:08 - INFO - __main__ - best_f1 = 75.12567121424334
02/25/2021 07:13:08 - INFO - __main__ - best_f1_thresh = 0.0
02/25/2021 07:13:08 - INFO - __main__ - exact = 71.88579129116482
02/25/2021 07:13:08 - INFO - __main__ - f1 = 75.12567121424338
02/25/2021 07:13:08 - INFO - __main__ - total = 11873
How to make this colab evaluation code generalized for other transformer-based question answering models?