Hi, I’m trying to replicate the SQuAD experiment in the T5 paper. I’m following the paper’s recommended hyperparameters for finetuning:
- AdaFactor optimizer
- Batch size 128 (I’m doing 16 per GPU on 8xRTX 3090 GPUs)
- 2^18 steps for fine-tuning (which is around 300 epochs)
- Max sequence length 512
- Learning rate 0.001
I’m running the following:
run_seq2seq_qa.py --model_name_or_path t5-base --dataset_name squad --context_column context --question_column question --answer_column answers --do_train --do_eval --per_device_train_batch_size 16 --optim adafactor --learning_rate 0.001 --num_train_epochs 300 --evaluation_strategy epoch --max_seq_length 512 --predict_with_generate --output_dir /tmp/t5_squad/ --overwrite_output_dir
After 4 epochs, the validation Exact Match score is 79.054 and F1 is 86.895. After 4 epochs, the model starts to overfit and the performance decreases. However, the paper reports 85.44 EM and 92.08 F1 score on T5-base (Table 14).
Has anyone been able to reproduce the official paper results or am I missing anything?