I am working on fine-tuning different transformers models for question answering in aerospace engineering domain.
I have selected BERT as my baseline model which I am fine-tuning on SQuAD 2.0 using run_qa.py. For testing, I have created my own test dataset that comprises of questions related to scientific papers in NASA Technical Reports Server.
As the baseline, I use the parameters below,
python3 run_qa.py \ --model_name_or_path bert-base-uncased \ --dataset_name squad_v2 \ --do_train \ --do_eval \ --version_2_with_negative \ --num_train_epochs 3 \ --learning_rate 3e-5 \ --max_seq_length 384 \ --doc_stride 128 \ --per_device_train_batch_size 16 \ --gradient_accumulation_steps 1 \ --per_device_eval_batch_size 16 \ --logging_steps 50 \ --evaluation_strategy epoch \ --save_strategy epoch \ --output_dir trials/bert_base_lr_3e_5
If I only change max_seq_length from 384 to 448, then on the 2nd epoch, validation F1 and EM scores increase. However, on my test set, accuracies drop considerably per the table below. I am wondering if there is a logical explanation as to why the accuracy is dropping in my test set despite the increase in validation scores? As far as I understand, increasing the maximum sequence length could increase the model prediction accuracy because the model can take longer dependencies into account.
|max_seq_length||Epoch||Test F1||Test EM||Eval F1||Eval EM|