Hi,
I am working on fine-tuning different transformers models for question answering in aerospace engineering domain.
I have selected BERT as my baseline model which I am fine-tuning on SQuAD 2.0 using run_qa.py. For testing, I have created my own test dataset that comprises of questions related to scientific papers in NASA Technical Reports Server.
As the baseline, I use the parameters below,
python3 run_qa.py \
--model_name_or_path bert-base-uncased \
--dataset_name squad_v2 \
--do_train \
--do_eval \
--version_2_with_negative \
--num_train_epochs 3 \
--learning_rate 3e-5 \
--max_seq_length 384 \
--doc_stride 128 \
--per_device_train_batch_size 16 \
--gradient_accumulation_steps 1 \
--per_device_eval_batch_size 16 \
--logging_steps 50 \
--evaluation_strategy epoch \
--save_strategy epoch \
--output_dir trials/bert_base_lr_3e_5
If I only change max_seq_length from 384 to 448, then on the 2nd epoch, validation F1 and EM scores increase. However, on my test set, accuracies drop considerably per the table below. I am wondering if there is a logical explanation as to why the accuracy is dropping in my test set despite the increase in validation scores? As far as I understand, increasing the maximum sequence length could increase the model prediction accuracy because the model can take longer dependencies into account.
max_seq_length | Epoch | Test F1 | Test EM | Eval F1 | Eval EM |
---|---|---|---|---|---|
384 | 3 | 63.16 | 57 | 76.72 | 73.33 |
448 | 2 | 57.65 | 54 | 77.32 | 74.29 |