The effectiveness of initializing Encoder-Decoder models from pre-trained encoder-only models, such as BERT and RoBERTa, for sequence-to-sequence tasks has been shown in: https://arxiv.org/abs/1907.12461.
Similarly, the EncoderDecoderModel
framework of Transformers can be used to leverage initialize Encoder-Decoder models from “bert-base-cased” or “roberta-base” for summarization.
One can initialize such a model with weights from pre-trained checkpoints via:
from transformers import EncoderDecoderModel
bert2bert = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-uncased", "bert-base-uncased")
A couple of models based on “bert-base-cased” or “roberta-base” have been trained this way for the CNN/Daily-Mail summarization task with the purpose of verifying that the EncoderDecoderModel
framework is functional.
Below the Rouge2 - fmeasure results on the test set of CNN/Daily-Mai:
Bert2GPT2: 15.19 https://huggingface.co/patrickvonplaten/bert2gpt2-cnn_dailymail-fp16
Bert2Bert: 16.1 - https://huggingface.co/patrickvonplaten/bert2bert-cnn_dailymail-fp16
Roberta2Roberta: 16.79: https://huggingface.co/patrickvonplaten/roberta2roberta-cnn_dailymail-fp16
Roberta2Roberta (shared): 16.59: https://huggingface.co/patrickvonplaten/roberta2roberta-share-cnn_dailymail-fp16
Note: The models below were trained without any hyper-parameter search and fp16 precision. For more detail, please refer to the respective model card.
UPDATE:
Better models using the Seq2Seq Trainer and code on the current master give the following results:
BERT2BERT on CNN/Dailymail: 18.22 - https://huggingface.co/patrickvonplaten/bert2bert_cnn_daily_mail
Roberta2Roberta (shared) on BBC/XSum: 16.89 - https://huggingface.co/patrickvonplaten/roberta_shared_bbc_xsum
Also two notebooks are attached to the model cards showing how Encoder-Decoder models can be trained using master.