Why does increasing sequence length reduce Q&A performance on my test set?

milesc · August 30, 2021, 4:42pm

Hi,

I am working on fine-tuning different transformers models for question answering in aerospace engineering domain.

I have selected BERT as my baseline model which I am fine-tuning on SQuAD 2.0 using run_qa.py. For testing, I have created my own test dataset that comprises of questions related to scientific papers in NASA Technical Reports Server.

As the baseline, I use the parameters below,

python3 run_qa.py \
    --model_name_or_path bert-base-uncased \
    --dataset_name squad_v2 \
    --do_train \
    --do_eval \
    --version_2_with_negative \
    --num_train_epochs 3 \
    --learning_rate 3e-5 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --per_device_train_batch_size 16 \
    --gradient_accumulation_steps 1 \
    --per_device_eval_batch_size 16 \
    --logging_steps 50 \
    --evaluation_strategy epoch \
    --save_strategy epoch \
    --output_dir trials/bert_base_lr_3e_5

If I only change max_seq_length from 384 to 448, then on the 2nd epoch, validation F1 and EM scores increase. However, on my test set, accuracies drop considerably per the table below. I am wondering if there is a logical explanation as to why the accuracy is dropping in my test set despite the increase in validation scores? As far as I understand, increasing the maximum sequence length could increase the model prediction accuracy because the model can take longer dependencies into account.

max_seq_length	Epoch	Test F1	Test EM	Eval F1	Eval EM
384	3	63.16	57	76.72	73.33
448	2	57.65	54	77.32	74.29

Topic		Replies	Views
Unit of max_answer_length in run_qa.py script? 🤗Transformers	1	536	February 4, 2022
Finetuning Sequence-Pairs (GLUE) with higher sequence lengths seems to fail? Beginners	1	626	December 4, 2020
SQuAD/BERT: Why max_length=384 by default and not 512? Models	1	2509	November 15, 2021
Seq-2-Seq Predictions for Longer Sequences and Question for compute metrics function Beginners	0	461	December 16, 2021
Fine tuning Sequence 🤗Transformers	0	216	August 27, 2021

Why does increasing sequence length reduce Q&A performance on my test set?

Related topics