T5 model for summarization far from SOTA results


I am having a hard time achieving State of the Art results after fine-tuning T5-base for text summarization.

I am trying the full implementation with the Transformers library (meaning, using the Seq2SeqTrainer() class). I am also using the XSum Dataset for fine-tuning the model.

I have preprocessed the dataset according to the documentation:

  • "summarize: " prefix, and " " token at the end of the label texts
  • max_input_length == 512

I am also only using the attention_mask provided by the tokenization of the input sequence.

As to my training arguments:

  • I have tried using both AdamW and Adafactor, with a learning rate of 3e-4 and weight_decay of 5e-5.
  • my batch_size is currently 4, with gradient_accumulation_steps = 64, eval_accumulation_steps = 64
  • I am using predict_with_generate = True, for obvious reasons
  • For what I’ve read, FP16 had performance problems on this model, and so fp16=False.

Anyhow, I am only scoring aproximately 28 at the ROUGE-1 Score, where as the model reached ~43 ROUGE1 on the paper, even after I purposedly used 32 training examples in 100 epochs of training to force overfitting.

The Summarization example notebook by Huggingface also reaches only about 28 at the ROUGE1 Score.

Are there any tips for the fine tune? Should I implement my own trainer with PyTorch Lightning?
I have already checked the discussion at T5 Finetuning Tips, but the results I’m getting aren’t improving at all.

Thanks a lot!