How to improve F1 score in SQAUD2 Question Answering Task on Distilbert Pretarined Model

While using Colab with the inference code written I am getting the below results.

{
'exact': 31.272635391223783, 
'f1': 35.63616173418905,
 'total': 11873, 
'HasAns_exact': 59.83468286099865, 
'HasAns_f1': 68.57424903340527,
 'HasAns_total': 5928,
 'NoAns_exact': 2.7922624053826746, 
'NoAns_f1': 2.7922624053826746,
 'NoAns_total': 5945,
 'best_exact': 50.07159100480081, 
'best_exact_thresh': 0.0, 
'best_f1': 50.07159100480081,
 'best_f1_thresh': 0.0}

When we use Huggingface script for the evaluation script below we get better results. What things should I change in the colab code to move the EM and F1?

python run_qa.py \
--model_name_or_path /path/to/distilbert-squad2 \
--dataset_name squad_v2 \
--version_2_with_negative \
--do_eval \
--per_device_train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ./tmp
Eval Results:
02/25/2021 07:13:08 - INFO - __main__ -   ***** Eval results *****
02/25/2021 07:13:08 - INFO - __main__ -     HasAns_exact = 71.54183535762483
02/25/2021 07:13:08 - INFO - __main__ -     HasAns_f1 = 78.03088635740741
02/25/2021 07:13:08 - INFO - __main__ -     HasAns_total = 5928
02/25/2021 07:13:08 - INFO - __main__ -     NoAns_exact = 72.22876366694702
02/25/2021 07:13:08 - INFO - __main__ -     NoAns_f1 = 72.22876366694702
02/25/2021 07:13:08 - INFO - __main__ -     NoAns_total = 5945
02/25/2021 07:13:08 - INFO - __main__ -     best_exact = 71.88579129116482
02/25/2021 07:13:08 - INFO - __main__ -     best_exact_thresh = 0.0
02/25/2021 07:13:08 - INFO - __main__ -     best_f1 = 75.12567121424334
02/25/2021 07:13:08 - INFO - __main__ -     best_f1_thresh = 0.0
02/25/2021 07:13:08 - INFO - __main__ -     exact = 71.88579129116482
02/25/2021 07:13:08 - INFO - __main__ -     f1 = 75.12567121424338
02/25/2021 07:13:08 - INFO - __main__ -     total = 11873

How to make this colab evaluation code generalized for other transformer-based question answering models?

Hi @bhadresh-savani, there’s a lot of tricky pre- and post-processing needed to get the question-answering working. For example, I think your implementation is missing the sliding window needed to chunk long documents into passages and the sorting of the predicted answers in the evaluation.

Sylvain Gugger has a nice Colab tutorial with all these details here, so my suggestion would be to compare his implementation against yours to see what you need to add.

1 Like