Replicating SQuAD results on T5

Hi, I’m trying to replicate the SQuAD experiment in the T5 paper. I’m following the paper’s recommended hyperparameters for finetuning:

  • AdaFactor optimizer
  • Batch size 128 (I’m doing 16 per GPU on 8xRTX 3090 GPUs)
  • 2^18 steps for fine-tuning (which is around 300 epochs)
  • Max sequence length 512
  • Learning rate 0.001

I’m running the following: --model_name_or_path t5-base --dataset_name squad --context_column context --question_column question --answer_column answers --do_train --do_eval --per_device_train_batch_size 16 --optim adafactor --learning_rate 0.001 --num_train_epochs 300 --evaluation_strategy epoch --max_seq_length 512 --predict_with_generate --output_dir /tmp/t5_squad/ --overwrite_output_dir

After 4 epochs, the validation Exact Match score is 79.054 and F1 is 86.895. After 4 epochs, the model starts to overfit and the performance decreases. However, the paper reports 85.44 EM and 92.08 F1 score on T5-base (Table 14).

Has anyone been able to reproduce the official paper results or am I missing anything?

Hi, I’m also having trouble replicating the SQuAD results.

I used t5-base and google/t5-v1_1-base checkpoint for tuning on SQuAD. Using AdaFactor with lr 0.001 and gradient accumulation, I got 79.4 EM and 87.81 F1, which are close to your results.

I also tried other hyper-parameter settings and get the best result of 84.68 EM and 91.56 F1(AdaFactor lr 8e-5, batch_size 16 with no gradient accumulation), which are still a little lower than the reported 85.44 EM and 92.08 F1.

I’m still trying to find what went wrong. :thinking:

Setting scale_parameter=True in Adafactor results in 84.2 EM and 91.1 F1