Why does increasing sequence length reduce Q&A performance on my test set?


I am working on fine-tuning different transformers models for question answering in aerospace engineering domain.

I have selected BERT as my baseline model which I am fine-tuning on SQuAD 2.0 using run_qa.py. For testing, I have created my own test dataset that comprises of questions related to scientific papers in NASA Technical Reports Server.

As the baseline, I use the parameters below,

python3 run_qa.py \
    --model_name_or_path bert-base-uncased \
    --dataset_name squad_v2 \
    --do_train \
    --do_eval \
    --version_2_with_negative \
    --num_train_epochs 3 \
    --learning_rate 3e-5 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --per_device_train_batch_size 16 \
    --gradient_accumulation_steps 1 \
    --per_device_eval_batch_size 16 \
    --logging_steps 50 \
    --evaluation_strategy epoch \
    --save_strategy epoch \
    --output_dir trials/bert_base_lr_3e_5

If I only change max_seq_length from 384 to 448, then on the 2nd epoch, validation F1 and EM scores increase. However, on my test set, accuracies drop considerably per the table below. I am wondering if there is a logical explanation as to why the accuracy is dropping in my test set despite the increase in validation scores? As far as I understand, increasing the maximum sequence length could increase the model prediction accuracy because the model can take longer dependencies into account.

max_seq_length Epoch Test F1 Test EM Eval F1 Eval EM
384 3 63.16 57 76.72 73.33
448 2 57.65 54 77.32 74.29